How to multiply polynomials in Θ(n log n) time

The information in this post is taken from the sources listed in the Sources section at the end of it. If you don’t care about understanding the algorithm and need an implementation now, skip to the Implementation subsection.

Why?

Why would anyone want to multiply polynomials in $\Theta (n \log n)$ instead of the naive $\Theta(n^2)$ time? Sometimes happens. I did when I got confronted with the UVa Online Judge 12879 Golf Bot problem. I knew I needed a discrete convolution, but then I got stuck because $200 000^2$ is way too high and the only way to calculate it I knew was quadratic.

If you mean “Why would you write this post?”, well, talking about an algorithm I recently learned to solve a problem is proving to be an easy way to kick-start my blog.

Gathering building blocks

So how do we do it? We are going to have to talk a bit about the representation of polynomials in computer memory first. (Haha get it?)

Coefficient representation

As far as I know, there are two useful ways of doing that. The first, and most obvious, is to store the coefficients of every term in an array, vector, list, you name it. Let $A(x)$ be a polynomial of degree-bound $n$:

$$A(x) = a_0 + a_1x + a_2x^2 + ... + a_{n-1}x^{n-1}$$

The “degree-bound $n$” is any $n$ strictly greater than the degree of the polynomial. We will always be talking about the smallest $n$ that fulfills this condition, that is, the degree plus one. The polynomial’s coefficients are $a_0, a_1, …, a_{n-1}$, and that’s what we store. For example, the polynomial:

$$3 + 2x^2 + 5x^3$$

Has the representation $3, 0, 2, 5$. Notice there is no “$x = x^1$” term, so the coefficient in position 1 is 0.

There is a straightforward way to evaluate the polynomial’s value for any $x_0$ in $\Theta(n)$. We can elevate $x_0$ to the required value for each coefficient, multiply for every coefficient and add all that up. I like Horner’s rule though:

$$A(x_0) = a_0 + x_0(a_1 + x_0(... + x_0(a_n)))$$

Multiplying two polynomials of degree-bound $n$ and $m$ takes $\Theta(nm)$ time. One has to multiply every coefficient of one with every coefficient of the other. The resulting polynomial will be of degree-bound $m + n - 1$. You’ve probably done this by hand countless times when trying to solve or simplify equations.

Point-value representation

To represent a polynomial $A(x)$ of degree-bound $n$, we store $n-1$ point-value pairs:

$$(x_0, A(x_0)), (x_1, A(x_1)), \; …, \; (x_n, A(x_n))$$

The algorithm to multiply polynomials in point-value representation is as follows. Let $A(x)$ and $B(x)$ be polynomials of degree-bound $n$ and $m$ respectively. They are both of degree-bound $n+m-1$ too. (Proof: $n+m-1 \geq n$ and $n+m-1 \geq m$ if $n, m \geq 1$, which is the case with degree-bounds). Represent $A(x)$ and $B(x)$ as vectors of $n+m-1$ point-value pairs, with the same points. Then multiply the values of each corresponding point-value pair.

Intuitively, because the value at each point of the result is the multiplication of the values at each point of the two polynomials we are multiplying, that is exactly what we do!

Let us do an example. We multiply:

$$A(x) = 3x^2 + x$$
$$B(x) = -5x^3 + 2x^2 - x + 4$$

Their minimum degree-bounds are, respectively, $3$ and $4$. Thus, we need $3+4-1=6$ point-value pairs for each:

$x_0$ $A(x_0)$ $B(x_0)$ $C(x_0) = A(x_0) \cdot B(x_0)$
0 $A(0) = 0$ $B(0) = 4$ $C(0) = 0$
1 $A(1) = 4$ $B(1) = 3$ $C(1) = 12$
2 $A(2) = 14$ $B(2) = -6$ $C(2) = -84$
3 $A(3) = 30$ $B(3) = -35$ $C(3) = -1050$
4 $A(4) = 52$ $B(4) = -96$ $C(4) = -4992$
5 $A(5) = 80$ $B(5) = -201$ $C(5) = -16080$

$C(x)$ on the right is the result of $A(x) \cdot B(x)$, and its point-value representation is $(0, 0), (1, 12), (2, -84), (3, -1050), (4, -4992), (5, -16080)$.

“That’s it!”, you may think. “Problem solved, and it’s $\Theta(n)$!“. Not so fast.

To convert from coefficient to point-value representation we need to evaluate $A(x)$ at $n$ points. The time needed to do that with what we know, Horner’s rule, is of course $\Theta(n^2)$.

Cutting time with FFT (Fast Fourier Transform)

FFT is a very important algorithm in signal analysis. It computes the DFT (Discrete Fourier Transform) of a signal in $\Theta(n \log n)$ time. The DFT maps the time domain to the frequency domain, that is; from a sequence of samples it extracts which frequencies are present and in which intensity. It is used, for example, to measure the pitch of an sound input, in applications such as Learn by Music Technology Group or one of those singing games.

But we are more interested in the mathematical definition of the DFT.

$$X_k = \sum_{n=0}^{N-1} x_n \cdot e^{-j2\pi k n / N}$$

Let’s break this definition down and rearrange it to understand why it is relevant to us. We also change the variables $A_k = X_k$, $a = x$, $i = n$, to reflect the correspondences with the naming conventions we used for polynomials.

$$A_k = \sum_{i=0}^{N-1} a_i \cdot \left( e^{-j2\pi k/N} \right)^i$$

For each $k$, the DFT is the sum of all the input coefficients $a_i$ multiplied by $\left(x_k\right)^i = \left( e^{-j2\pi k/N} \right)^i$. By doing this, we are effectively evaluating the polynomial expressed by the coefficients $a_i$ in all the points $x_k$.

We can calculate the coefficient representation from a point-value representation with points $\left(x_k\right)^i = \left( e^{-j2\pi k/N} \right)^i$ by using the inverse DFT. Turns out this is quite the simple reversal:

$$a_i = \frac{1}{N} \sum_{k=0}^{N-1} A_k \cdot \left( e^{j2\pi k/N} \right)^i$$

You can look up the proof elsewhere.

The Algorithm

The building blocks of our algorithm are complete.

Input: $A_v$ and $B_v$, coefficient representation of polynomials $A(x)$ and $B(x)$ respectively.

Output: $C_v$, coefficient representation of polynomial $C(x) = A(x) \cdot B(x)$.

  1. Let $\left| A_v \right|$ be the length of $A_v$, that is, the lowest degree-bound of $A(x)$. Let $\left| B_v \right|$ be the length of $B_v$. Let $\left| C_v \right| = \left| A_v \right| + \left| B_v \right| - 1$ be the lowest degree-bound of the result $C_v$.

  2. Let $n = \min_i 2^i$ subject to $2^i \geq \left| C_v \right|$. That is, the smallest power of two that is greater or equal than $\left| C_v \right|$. To compute it: iterate from $i=0$, increasing by $1$ every time, until $2^i \geq \left| C_v \right|$.

  3. Let $A_v$ be the vector representing $A(x)$ padded with zeroes so that its total length is $n$. Let $B_v$ be the vector representing $B(x)$ padded with zeroes so that its total length is $n$.

  4. Let $A_v’ = FFT(A_v)$ and $B_v’ = FFT(B_v)$.

  5. Let $C_v’ = A_v’ \cdot B_v’$, the dot product of $A_v’$ and $B_v’$. That is, multiplying every element of $A_v’$ with the element in the same position $B_v’$, and assigning that to the same position in $C_v’$.

  6. Let $C_v = FFT^{-1}(C_v’)$. $C_v$ is $C(x)$ in coefficient representation. Terminate.

Analysis

We show, as formally as I can, why this algorithm is $\Theta(n \log n)$. You may skip this section if it’s obvious to you, or try and figuring out yourself before reading it.

  1. is $\Theta(1)$ if we know the length of the inputs. It may be up to $O(n)$ if we have to measure them, for example by iterating until we find a sentinel at the end.

  2. is $\Theta(\log n)$. Proof: $n = 2^i$, $i = \log n$.

  3. is $\Theta(n)$. We have to add $\left( n - \left| A_v \right| \right) + \left( n - \left| B_v \right| \right)$, which a linear function of $n$.

  4. is $\Theta (n \log n)$, because the FFT is too.

  5. is $\Theta(n)$, because we iterate once for each elemnt of $C_v’$.

  6. is $\Theta (n \log n)$. The inverse FFT is like a regular FFT, with then dividing every element by $N$. Thus, it’s $\Theta(n \log n + n) = \Theta(n \log n)$

Thus, the whole algorithm is:

$$\Theta(n + \log n + n + n \log n + n + n \log n) = \Theta(n \log n)$$

Isn’t it easy to simplify expressions in asymptotic notation?

Implementation

If you are writing an application that needs an FFT of any form, I suggest using a library. Numerous smart people have spent their time making said library bug-free and performant, and you can greatly profit from their effort. I only know of one FFT library, the Fastest Fourier Transform in the West, but there are probably others that are just as fast, or bindings for this one to other languages.

However, if you must embed the routine in your program, for instance in a programming contest, you have to “implement it yourself”. The algorithm is not very complicated, but time spent debugging it is time wasted. I suggest copying Stanford University ACM Team’s Notebook. I endorse copying many things from there, actually. At the very least, their implementation is clear enough that you can base yours off of it.

Assuming we have included Stanford’s implementation earlier in the C++ file:

// Assume we have at most 2^18 = 262144 coefficients
const int MAX_LEN = 262144 * 2;

// For those of you who don't want to read the Stanford source:
// "cpx" represents a complex number.
// FFT(input, output, internal=1, size of input, direction)
// Direction is 1 for FFT, -1 for inverse FFT

cpx A[MAX_LEN], B[MAX_LEN], C[MAX_LEN];
int A_len, B_len, C_len;

/* set the appropriate coefficients in the inputs A and B's real-valued part,
 * and their length in A_len and B_len.
 */

// find the immediately greatest power of two
for(C_len = 1; !(C_len > A_len + B_len - 1); C_len *= 2);

// This is assumed in the 12879 Golf Bot problem
assert(C_len < MAX_LEN); 

// pad A and B with zeroes
memset(A + A_len, 0, (C_len - A_len) * sizeof(cpx));
memset(B + B_len, 0, (C_len - B_len) * sizeof(cpx));

// fill C with A's point-value
FFT(A, C, 1, C_len, 1);

// fill A with B's point-value
FFT(B, A, 1, C_len, 1);

// Multiply the polynomials in point-value
for(int i=0; i<C_len; i++)
    A[i] = A[i] * C[i];

// Convert back to coefficient representation
FFT(C, A, 1, C_len, -1);
for(int i=0; i<C_len; i++)
    C[i].a /= C_len;

// now C[i].a (the real-valued parts) contain the result

Solving 12879 Golf Bot

To solve 12879 Golf Bot, what we need to do is: have a vector that is 1 at the distances which the golf bot can shoot and 0 at the distances which it cannot. We need to add that vector to a vector of zeroes itself starting from every 1 it itself has. This is the operation called discrete convolution, and it’s the equivalent of multiplying two polynomials in coefficient form. So you do that, but you don’t need to divide by C_len at the end, because we only need to know if a given position is $>0$.

It’s a very quick explanation, but the problem isn’t the focus of this post :)

Final thoughts

Hope this post was enjoyable, easy to read, and useful.

Don’t use this in production over the simple naive quadratic algorithm if you do not handle big enough inputs. How big is big enough? Measure.

Sources

This pdf from Portland State University’s Algorithm Design and Analysis class. Contains more information about operations, properties and algorithms related to polynomials.

Wikipedia’s definition of the DFT and inverse DFT.

A Stack Exchange answer with an example of this algorithm. Helped me understand this a lot.