They are directly connected to the convolution operation and Fast Fourier Transform.
One of the key operation in signal processing/machine learning is the convolution of two functions.
Let $x(t)$ and $y(t)$ be two given functions. Their convolution is defined as
A well-known fact: a convolution in the time domain is a product in the frequency domain.
Let us approximate the integral by a quadrature sum on a uniform grid, and store the signal at equidistant points.
Then we are left with the summation
$$z_i = \sum_{j=0}^{n-1} x_j y_{i - j},$$which is called discrete convolution. This can be thought as an application of a filter with coefficients $x$ to a signal $y$.
There are different possible filters for different purposes, but they all utilize the shift-invariant structure.
A discrete convolution can be thought as a matrix-by-vector product:
$$z_i = \sum_{j=0}^{n-1} x_j y_{i - j}, \Leftrightarrow z = Ax$$where the matrix $A$ elements are given as $a_{ij} = y_{i-j}$, i.e., they depend only on the difference between the row index and the column index.
A matrix is called Toeplitz if its elements are defined as
$$a_{ij} = t_{i - j}.$$A Toeplitz matrix is completely defined by its first column and first row (i.e., $2n-1$ parameters).
It is a dense matrix, however it is a structured matrix (i.e., defined by $\mathcal{O}(n)$ parameters).
And the main operation in the discrete convolution is the product of Toeplitz matrix by vector.
Can we compute it faster than $\mathcal{O}(n^2)$?
For a special class of Toeplitz matrices, named circulant matrices the fast matrix-by-vector product can be done.
A matrix $C$ is called circulant, if
i.e. it periodicaly wraps
$$C = \begin{bmatrix} c_0 & c_1 & c_2 & c_3 \\ c_3 & c_0 & c_1 & c_2 \\ c_2 & c_3 & c_0 & c_1 \\ c_1 & c_2 & c_3 & c_0 \\ \end{bmatrix}. $$Theorem:
Any circulant matrix can be represented in the form
$$C = \frac{1}{n} F^* \Lambda F,$$where $F$ is the Fourier matrix with the elements
$$F_{kl} = w_n^{kl}, \quad k, l = 0, \ldots, n-1, \quad w_n = e^{-\frac{2 \pi i}{n}},$$and matrix $\Lambda = \text{diag}(\lambda)$ is the diagonal matrix and
$$\lambda = F c, $$where $c$ is the first column of the circulant matrix $C$.
The proof will be later: now we need to study the FFT matrix.
The Fourier matrix is defined as:
$$ F_n = \begin{pmatrix} 1 & 1 & 1 & \dots & 1 \\ 1 & w^{1\cdot 1}_n & w^{1\cdot 2}_n & \dots & w^{1\cdot (n-1)}_n\\ 1 & w^{2\cdot 1}_n & w^{2\cdot 2}_n & \dots & w^{2\cdot (n-1)}_n\\ \dots & \dots & \dots &\dots &\dots \\ 1 & w^{(n-1)\cdot 1}_n & w^{(n-1)\cdot 2}_n & \dots & w^{(n-1)\cdot (n-1)}_n\\ \end{pmatrix}, $$or equivalently
$$ F_n = \{ w_n^{kl} \}_{k,l=0}^{n-1}, $$where
$$w_n = e^{-\frac{2\pi i}{n}}.$$Properties:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
N = 1000
dt = 1.0 / 800.0
x = np.linspace(0.0, N*dt, N)
y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x) + 0.2*np.sin(300.0 * 2.0*np.pi*x)
plt.plot(x, y)
plt.xlabel('Time')
plt.ylabel('Signal')
plt.title('Initial signal')
yf = np.fft.fft(y)
xf = np.linspace(0.0, 1.0/(2.0*dt), N//2)
plt.plot(xf, 2.0/N * np.abs(yf[0:N//2])) #Note: N/2 to N will give negative frequencies
plt.xlabel('Frequency')
plt.ylabel('Amplitude')
plt.title('Discrete Fourier transform')
Here we consider a matrix interpretation of the standard Cooley-Tukey algorithm (1965), which has underlying divide and conquer idea. Note that in packages more advanced versions are used.
First of all we permute the rows of the Fourier matrix such that the first $n/2$ rows of the new matrix had row numbers $1,3,5,\dots,n-1$ and the last $n/2$ rows had row numbers $2,4,6\dots,n$.
This permutation can be expressed in terms of multiplication by permutation matrix $P_n$:
Hence,
$$ P_n F_n = \begin{pmatrix} 1 & 1 & 1 & \dots & 1 \\ 1 & w^{2\cdot 1}_n & w^{2\cdot 2}_n & \dots & w^{2\cdot (n-1)}_n\\ 1 & w^{4\cdot 1}_n & w^{4\cdot 2}_n & \dots & w^{4\cdot (n-1)}_n\\ \vdots & & & & \vdots\\ 1 & w^{(n-2)\cdot 1}_n & w^{(n-2)\cdot 2}_n & \dots & w^{(n-2)\cdot (n-1)}_n\\ \hline 1 & w^{1\cdot 1}_n & w^{1\cdot 2}_n & \dots & w^{1\cdot (n-1)}_n\\ 1 & w^{3\cdot 1}_n & w^{3\cdot 2}_n & \dots & w^{3\cdot (n-1)}_n\\ \vdots & & & & \vdots\\ 1 & w^{(n-1)\cdot 1}_n & w^{(n-1)\cdot 2}_n & \dots & w^{(n-1)\cdot (n-1)}_n\\ \end{pmatrix}, $$Now let us imagine that we separated its columns and rows by two parts each of size $n/2$.
As a result we get $2\times 2$ block matrix that has the following form
$$ P_n F_n = \begin{pmatrix} \left\{w^{2kl}_n\right\} & \left\{w_n^{2k\left(\frac{n}{2} + l\right)}\right\} \\ \left\{w_n^{(2k+1)l}\right\} & \left\{w_n^{(2k+1)\left(\frac{n}{2} + l\right)}\right\} \end{pmatrix}, \quad k,l = 0,\dots, \frac{n}{2}-1. $$So far it does not look like something that works faster :) But we will see that in a minute. Lets have a more precise look at the first block $\left\{w^{2kl}_n\right\}$:
$$ w^{2kl}_n = e^{-2kl\frac{2\pi i}{n}} = e^{-kl\frac{2\pi i}{n/2}} = w^{kl}_{n/2}. $$So this block is exactly twice smaller Fourier matrix $F_{n/2}$!
<!--- Now we can write
$$ \begin{pmatrix} F_{n/2} & \left\{w_n^{2k\left(\frac{n}{2} + l\right)}\right\} \\ \left\{w_n^{(2k+1)l}\right\} & \left\{w_n^{(2k+1)\left(\frac{n}{2} + l\right)}\right\} \end{pmatrix} $$--> The block $\left\{w_n^{(2k+1)l}\right\}$ can be written as
$$ w_n^{(2k+1)l} = w_n^{2kl + l} = w_n^{l} w_n^{2kl} = w_n^{l} w_{n/2}^{kl}, $$which can be written as $W_{n/2}F_{n/2}$, where
$$W_{n/2} = \text{diag}(1,w_n,w_n^2,\dots,w_n^{n/2-1}).$$Doing the same tricks for the other blocks we will finally get
$$ P_n F_n = \begin{pmatrix} F_{n/2} & F_{n/2} \\ F_{n/2}W_{n/2} & -F_{n/2}W_{n/2} \end{pmatrix} = \begin{pmatrix} F_{n/2} & 0 \\ 0 & F_{n/2} \end{pmatrix} \begin{pmatrix} I_{n/2} & I_{n/2} \\ W_{n/2} & -W_{n/2} \end{pmatrix}. $$#FFT vs full matvec
import time
import numpy as np
import scipy as sp
import scipy.linalg
n = 10000
F = sp.linalg.dft(n)
x = np.random.randn(n)
y_full = F.dot(x)
full_mv_time = %timeit -q -o F.dot(x)
print('Full matvec time =', full_mv_time.average)
y_fft = np.fft.fft(x)
fft_mv_time = %timeit -q -o np.fft.fft(x)
print('FFT time =', fft_mv_time.average)
print('Relative error =', (np.linalg.norm(y_full - y_fft)) / np.linalg.norm(y_full))
FFT helps to multiply fast by certain types of matrices. We start from a circulant matrix:
$$ C = \begin{pmatrix} c_0 & c_{n-1} & c_{n-2} & \dots & c_1 \\ c_{1} & c_{0} & c_{n-1} & \dots & c_2 \\ c_{2} & c_{1} & c_0 & \dots & c_3 \\ \dots & \dots & \dots & \dots & \dots \\ c_{n-1} & c_{n-2} & c_{n-3} & \dots & c_0 \end{pmatrix} $$Theorem. Let $C$ be a circulant matrix of size $n\times n$ and let $c$ be it's first column , then
$$ C = \frac{1}{n} F_n^* \text{diag}(F_n c) F_n $$Proof.
where $\omega$ is any number such that $\omega^n=1$.
and finally
$$ C = \frac{1}{n} F^*_n \Lambda F_n, \quad \text{where}\quad \Lambda = \text{diag}(F_nc) \qquad\blacksquare $$where $\circ$ denotes elementwise product (Hadamard product) of two vectors (since $\text{diag}(a)b = a\circ b$) and ifft
denotes inverse Fourier transform $F^{-1}_n$.
import numpy as np
import scipy as sp
import scipy.linalg
def circulant_matvec(c, x):
return np.fft.ifft(np.fft.fft(c) * np.fft.fft(x))
n = 5000
c = np.random.random(n)
C = sp.linalg.circulant(c)
x = np.random.randn(n)
y_full = C.dot(x)
full_mv_time = %timeit -q -o C.dot(x)
print('Full matvec time =', full_mv_time.average)
y_fft = circulant_matvec(c, x)
fft_mv_time = %timeit -q -o circulant_matvec(c, x)
print('FFT time =', fft_mv_time.average)
print('Relative error =', (np.linalg.norm(y_full - y_fft)) / np.linalg.norm(y_full))
Now we get back to Toeplitz matrices!
$$ T = \begin{pmatrix} t_0 & t_{-1} & t_{-2} & t_{-3}& \dots & t_{1-n} \\ t_{1} & t_{0} & t_{-1} & t_{-2}& \dots & t_{2-n} \\ t_{2} & t_{1} & t_0 & t_{-1} &\dots & t_{3-n} \\ t_{3} & t_{2} & t_1 & t_0 & \dots & t_{4-n} \\ \dots & \dots & \dots & \dots & \dots & \dots\\ t_{n-1} & t_{n-2} & t_{n-3} & t_{n-4} &\dots &t_0 \end{pmatrix}, $$or equivalently $T_{ij} = t_{i-j}$.
Matvec operation can be written as
$$ y_i = \sum_{j=1}^n t_{i-j} x_j, $$which can be interpreted as a discrete convolution of filter $t_i$ and signal $x_i$. For simplicity the size of the filter $t$ is such that the sizes of the input and output signals are the same. Generally, filter size can be arbitrary.
Fast convolution computation has a variety of applications, for instance, in signal processing or partial differential and integral equations. For instance, here is the smoothing of a signal:
from scipy import signal
%matplotlib inline
import matplotlib.pyplot as plt
alpha = 0.01
sig = np.repeat([0., 1., 0.], 100)
filt = np.exp(-alpha * (np.arange(100)-50)**2)
filtered = signal.convolve(sig, filt, mode='same') / sum(filt)
fig, (ax_orig, ax_filt, ax_filtered) = plt.subplots(3, 1, sharex=True)
ax_orig.plot(sig)
ax_orig.margins(0, 0.1)
ax_filt.plot(filt)
ax_filt.margins(0, 0.1)
ax_filtered.plot(filtered)
ax_filtered.margins(0, 0.1)
ax_orig.set_title('Original signal')
ax_filt.set_title('Filter')
ax_filtered.set_title('Convolution')
fig.tight_layout()
Key point: the multiplication by a Toeplitz matrix can be reduced to the multiplication by a circulant.
The 2-dimensional convolution is defined as
$$ y_{i_1i_2} = \sum_{j_1,j_2=1}^n t_{i_1-j_1, i_2-j_2} x_{j_1 j_2}. $$Note that $x$ and $y$ are 2-dimensional arrays and $T$ is 4-dimensional. To reduce this expression to matrix-by-vector product we have to reshape $x$ and $y$ into long vectors:
$$ \text{vec}(x) = \begin{pmatrix} x_{11} \\ \vdots \\ x_{1n} \\ \hline \\ \vdots \\ \hline \\ x_{n1} \\ \vdots \\ x_{nn} \end{pmatrix}, \quad \text{vec}(y) = \begin{pmatrix} y_{11} \\ \vdots \\ y_{1n} \\ \hline \\ \vdots \\ \hline \\ y_{n1} \\ \vdots \\ y_{nn} \end{pmatrix}. $$In this case matrix $T$ is block Toeplitz with Toeplitz blocks: (BTTB)
$$ T = \begin{pmatrix} T_0 & T_{-1} & T_{-2} & \dots & T_{1-n} \\ T_{1} & T_{0} & T_{-1} & \dots & T_{2-n} \\ T_{2} & T_{1} & T_0 & \dots & T_{3-n} \\ \dots & \dots & \dots & \dots & \dots\\ T_{n-1} & T_{n-2} & T_{n-3} &\dots &T_0 \end{pmatrix}, \quad \text{where} \quad T_k = t_{k, i_2 - j_2}\quad \text{are Toeplitz matrices} $$To get fast matvec we need to embed block Toeplitz matrix with Toeplitz blocks into the block circulant matrix with circulant blocks. The analog of $$\begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \star \\ \star \end{pmatrix} = \text{ifft}(\text{fft}(\begin{pmatrix} t_0 \\ t_{1} \\ t_{2} \\ t_{-2} \\ t_{-1} \end{pmatrix})\circ\text{fft}(\begin{pmatrix} x_1 \\ x_2 \\ x_3 \\ 0 \\ 0 \end{pmatrix})).$$ will look like $$ \begin{pmatrix} y_{11} & y_{12} & y_{13} & \star & \star \\ y_{21} & y_{22} & y_{23} & \star & \star \\ y_{31} & y_{32} & y_{33} & \star & \star \\ \star & \star & \star & \star & \star \\ \star & \star & \star & \star & \star \\ \end{pmatrix} = \text{ifft2d}(\text{fft2d}(\begin{pmatrix} t_{0,0} & t_{1,0} & t_{2,0} & t_{-2,0} & t_{-1,0} \\ t_{0,1} & t_{1,1} & t_{2,1} & t_{-2,1} & t_{-1,1} \\ t_{0,2} & t_{1,2} & t_{2,2} & t_{-2,2} & t_{-1,2} \\ t_{0,-2} & t_{1,-2} & t_{2,-2} & t_{-2,-2} & t_{-1,-2} \\ t_{0,-1} & t_{1,-1} & t_{2,-1} & t_{-2,-1} & t_{-1,-1} \end{pmatrix}) \circ \text{fft2d}(\begin{pmatrix}x_{11} & x_{12} & x_{13} & 0 & 0 \\ x_{21} & x_{22} & x_{23} & 0 & 0 \\ x_{31} & x_{32} & x_{33} & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ \end{pmatrix})),$$ where fft2d is 2-dimensional fft that consists of one-dimensional transforms, applied first to rows and and then to columns (or vice versa).
# Blurring and Sharpening Lena by convolution
from scipy import signal
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import misc
import imageio
filter_size = 3
filter_blur = np.ones((filter_size, filter_size)) / filter_size**2
lena = imageio.imread('../files/lena512.jpg')
#lena = misc.face()
#lena = lena[:, :, 0]
blurred = signal.convolve2d(lena, filter_blur, boundary='symm', mode='same')
fig, ax = plt.subplots(2, 2, figsize=(8, 8))
ax[0, 0].imshow(lena[200:300, 200:300], cmap='gray')
ax[0, 0].set_title('Original Lena')
ax[0, 1].imshow(blurred[200:300, 200:300], cmap='gray')
ax[0, 1].set_title('Blurred Lena')
ax[1, 0].imshow((lena - blurred)[200:300, 200:300], cmap='gray')
ax[1, 0].set_title('Lena $-$ Blurred Lena')
ax[1, 1].imshow(((lena - blurred)*3 + blurred)[200:300, 200:300], cmap='gray')
ax[1, 1].set_title('$3\cdot$(Lena $-$ Blurred Lena) + Blurred Lena')
fig.tight_layout()
we have the spectral theorem
$$C = \frac{1}{n}F^* \Lambda F, \quad C^{-1} = \frac{1}{n}F^* \Lambda^{-1} F,$$but for a general Toeplitz matrices, it is not a trivial question.
Not-a-bad recipe for Toeplitz linear system is to use iterative method (fast matvec is available).
A good choice for a preconditioner is a circulant matrix.
A natural idea is to use circulants as preconditioners, since they are easy to invert.
The first preconditioner was the preconditioner by Raymond Chan and Gilbert Strang, who proposed to take the first column of the matrix and use it to generate the circulant.
The second preconditioner is the Tony Chan preconditioner, which is also very natural:
import numpy as np
import scipy.linalg
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
n = 100
c = np.zeros(n)
c[0] = -2
c[1] = 1
Tm = sp.linalg.toeplitz(c, c)
c1 = sp.linalg.circulant(c) #Strang preconditioner
Fmat = 1.0/np.sqrt(n) * np.fft.fft(np.eye(n)) #Poor man's Fourier matrix
d2 = np.diag(Fmat.conj().dot(Tm).dot(Fmat))
c2 = Fmat.dot(np.diag(d2)).dot(Fmat.conj().T)
mat = np.linalg.inv(c1).dot(Tm)
ev = np.linalg.eigvals(mat).real
plt.plot(np.sort(ev), np.ones(n), 'o')
plt.xlabel('Eigenvalues for Strang preconditioner')
plt.gca().get_yaxis().set_visible(False)
mat = np.linalg.inv(c2).dot(Tm)
ev = np.linalg.eigvals(mat).real
plt.figure()
plt.plot(np.sort(ev), np.ones(n), 'o')
plt.xlabel('Eigenvalues for T. Chan Preconditioner')
plt.gca().get_yaxis().set_visible(False)
The idea with preconditioners works for 2D/3D convolutions, but much worse.
In 1D case it is also possible to find direct solvers for the Toeplitz matrix, based on the structure of the inverse.
But the inverse of a Toeplitz matrix is not Toeplitz!
What to do?
The Toeplitz matrices belong to a more general class of low displacement rank matrices.
Define the scaled periodic shift matrix $Z_e$ that takes the vector $x$ and transforms it to a vector
Given a Toeplitz matrix $T$, we select any $e, f$ such that $ef \ne 1$ and define the displacement operator as
$$L(T) = Z_e T - T Z_f.$$For a Toeplitz matrix, $L(T)$ has rank 2 (it has only first row and last column non-zero)
import numpy as np
import scipy.linalg
n = 5
c = np.zeros(n)
c[0] = -2
c[1] = 1
T = sp.linalg.toeplitz(c, c)
e = 0.5
f = 0.5
def Z_shift(e):
return np.diag(np.ones(n-1), -1) + e * np.diag(np.ones(1), n-1)
Z1 = Z_shift(e)
Z2 = Z_shift(f)
print(Z1.dot(T) - T.dot(Z2))
What about the inverse?
It is also of rank $2$ !
import numpy as np
import scipy.linalg
n = 5
c = np.zeros(n)
c[0] = -2
c[1] = 1
T = sp.linalg.toeplitz(c, c)
e = 0.5
f = 0.5
def Z_shift(e):
return np.diag(np.ones(n-1), -1) + e * np.diag(np.ones(1), n-1)
Z1 = Z_shift(e)
Z2 = Z_shift(f)
Tinv = np.linalg.inv(T)
p1 = Z1.dot(Tinv) - Tinv.dot(Z2)
np.linalg.svd(p1)[1]
where $G$ is $n \times r$ and $H$ is $n \times r$.
Let $T$ satisfy
$$Z_e T - T Z_f = GH ^{\top},$$and let it be invertible.
Then we have
$$T^{-1} (Z_e T - T Z_f) T^{-1} = T^{-1} Z_e - Z_f T^{-1} = T^{-1} G H^{\top} T^{-1},$$i.e. the inverse has small displacement rank with the reversed pair of generators $Z_e, Z_f$ (why?).
Does the low-rank representation after the displacement operator describe the structure?
I.e. we need to solve the equation of the form
for a given right-hand side.
with $A$, $B$ and $C$ given.
In general, this is a linear system with $\mathcal{O}(n^2)$ unknowns, and the computational cost is expected to be $n^6$.
For Sylvester equation we can solve it in $\mathcal{O}(n^3)$ (will see in the next part on matrix equations)
But we can do better for specific $A, B$.
For the particular case, we have
$$Z_e T - T Z_f = GH^{\top} = B,$$and the solution is given by
$$ (e - f) T = \sum_{j = 1}^r Z_e(g_j) Z_f( J h_j), $$where $Z_e$ is the e-scaled circulant generated by the vector, and $g_j$, and $h_j$ are the columns of the matrices $G$ and $H$, and $J$ is the per-identity matrix (which has ones on the anti-diagonal).
Based on this idea and for the special case when $e = 0, f \rightarrow \inf$, we get the following famous formula for the inverse of the Toeplitz matrices as a sum of two products of triangular Toeplitz matrices:
Let $A$ be a Toeplitz matrix, and
$$A \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix}=\begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \quad A \begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_n \end{bmatrix}=\begin{bmatrix} 0 \\ 0 \\ \vdots \\ 1 \end{bmatrix} $$then
$$A^{-1} = \frac{1}{x_0} \begin{bmatrix} x_0 & 0 & \ldots & 0 \\ x_1 & x_0 & 0 & \ldots \\ \ldots & \ldots \\ x_n & \ldots & \ldots & x_0 \end{bmatrix}\begin{bmatrix} u_0 & u_1 & \ldots & 0 \\ 0 & u_0 & u_1 & \ldots \\ \ldots & \ldots \\ 0 & \ldots & \ldots & u_0 \end{bmatrix}-\frac{1}{x_0} \begin{bmatrix} 0 & 0 & \ldots & 0 \\ y_0 & 0 & 0 & \ldots \\ y_1 & y_0 & \ldots \\ \ldots & \ldots & \\ y_{n-1} & \ldots & y_0 & 0 \end{bmatrix}\begin{bmatrix} 0 & v_0 & \ldots & 0 \\ 0 & 0 & v_0 & v_1 \\ \ldots & \ldots \\ \ldots & \ldots & \ldots & v_0 \\ 0 & \ldots & \ldots & 0\end{bmatrix},$$where $u_y = y_{n-i}, \quad v_i = x_{n-i}$.
The main meaning: the inverse can be recovered from the first column and the last column of the matrix.
For details, I refer to the book by Tyrtyshnikov "Brief introduction to numerical analysis".
For many years, these formulas gave rise to the fast $\mathcal{O}(n^2)$ and superfast $\mathcal{O}(n \log n)$ methods for Toeplitz matrices.
The basic idea is that you use augmentation method.
Consider that you have computed the inverse of a $(n-1) \times (n-1)$ block of the Toeplitz matrix.
You only need to store two vectors to represent the inverse.
Then, the bigger matrix can be written in the block form.
Application of $T^{-1}_{n-1}$ costs $\mathcal{O}(n \log n)$ operations, thus $x_2$ will be recovered in the same number of operations as well. The total complexity is then $\mathcal{O}(n^2 \log n)$ operations.
Superfast algorithms are obtained in terms of reducing the problem to the computations with polynomials and using block elimination (in the Fourier style).