Lecture 16: Matrix functions. Introduction to randomized linear algebra

Previous lecture

Today lecture

Outline of this part

Book to read: Functions of matrices by Nick Higham

The simplest matrix function: matrix polynomial

It is very easy to define a matrix polynomial as

$$ P(A) = \sum_{k=0}^n c_k A^k. $$

Side-note: Hamilton-Cayley theorem states that $F(A) = 0$ where $F(\lambda) = \det(A - \lambda I)$, thus all matrix polynomials have degree $\leq n-1$.

Matrix polynomials as building blocks

We can define a function of the matrix by Taylor series:

$$ f(A) = \sum_{k=0}^{\infty} c_k A^k. $$

The convergence is understood as the convergence in some matrix norm.

Example of such series is the Neumann series

$$ (I - F)^{-1} = \sum_{k=0}^{\infty} F^k, $$

which is well defined for $\rho(F) < 1$.

Matrix exponential series

The most well-known matrix function is matrix exponential. In the scalar case,

$$ e^x = 1 + x + \frac{x^2}{2} + \frac{x^3}{6} + \ldots = \sum_{k=0}^{\infty} \frac{x^k}{k!}, $$

and it directly translates to the matrix case:

$$ e^A = \sum_{k=0}^{\infty} \frac{A^k}{k!}, $$

the series that always converges, because the series

$$\sum_{k=0}^{\infty} \frac{\Vert A \Vert^k}{k!} = e^{\Vert A \Vert}.$$

Why matrix exponential is important

A lot of practical problems are reduced to a system of linear ODEs of the form

$$ \frac{dy}{dt} = Ay, \quad y(0) = y_0. $$

ODE and matrix exponentials

$$\frac{dy}{dt} = Ay, \quad y(0) = y_0$$ $$\frac{d}{dt} e^{At} = \frac{d}{dt} \sum_{k=0}^{\infty} \frac{t^k A^k}{k!} = \sum_{k=1}^{\infty} \frac{t^{k-1} A^{k}}{(k-1)!} = A e^{At}.$$

Sidenote: matrix exponential and time stepping

Matrix exponential can be much better than solving using, say, Euler scheme:

$$\frac{dy}{dt} \approx \frac{y_{k+1} - y_k}{\tau} = A y_k, \quad y_{k+1} = y_k + \tau A y_k,$$

if we know how to compute the product of the matrix exponential by vector using only matrix-by-vector product.

For dense matrices matrix exponential also provides exact answer to the ODE for any $t$, compared to the approximation by time-stepping schemes.

How to compute matrix functions, including exponential?

$$ A = S \Lambda S^{-1}, $$

where the columns of $S$ are eigenvectors of the matrix $A$, then

$$ F(A) = S F(\Lambda) S^{-1}. $$

Problem: diagonalization can be unstable! (and not every matrix is diagonalizable)

Let us look how matrices are diagonalizable:

Now we can compute a function for perturbed Jordan block.

How funm function works

Schur-Parlett algorithm

$$ A = U T U^*. $$

Computing functions of triangular matrices

We know values on the diagonals

$$ F_{ii} = F(T_{ii}), $$

and also we know that

$$ F T = T F $$

the matrix function commutes with the matrix itself. The function of a triangular matrix is a triangular matrix as well. Using the known values on the diagonal and the commutativity property, we get the diagonals of the matrix one-by-one:

$$f_{ij} = t_{ij} \frac{f_{ii} - f_{jj}}{t_{ii} - t_{jj}} + \sum_{k=i+1}^{j-1} \frac{f_{ik} t_{kj} - t_{ik}f_{kj}}{t_{ii} - t_{jj}}.$$

Matrix functions: definition

$$ f(A) = \frac{1}{2\pi i}\int_{\Gamma} f(z) (zI - A)^{-1} dz, $$

where $f(z)$ is analytic on and inside a closed contour $\Gamma$ that encloses the spectrum of $A$.

Important matrix functions

Matrix exponential

$$e^A = I + A + \frac{1}{2} A^2 + \frac{1}{3!} A^3 + \ldots$$

Series convergence

Method 1: Krylov method

$$ f(A)v \approx f(Q H Q^*)v = Q f(H) Q^*v,$$

where $H$ is a small upper Hessenberg matrix, for which we can use, for example, the Schur-Parlett algorithm.

Pade approximations

$$ \exp(x) \approx \frac{p(x)}{q(x)}, $$

where $p(x)$ and $q(x)$ are polynomials and computation of a rational function of a matrix is reduced to matrix-matrix products and matrix inversions.

Scaling & squaring algorithm

The "canonical algorithm" for the computation of the matrix exponential also relies on scaling of the matrix $A:$

$$\exp(A) = \exp(A/2^k)^{(2^k)}.$$

The matrix then can have a small norm, thus:

Large-scale matrix exponentials

Rational Krylov subspaces

The simplest (yet efficient) approach is based on the so-called extended Krylov subspaces:

$$KE(A, b) = \mathrm{Span}(\ldots, A^{-2} b, A^{-1} b, b, A b, A^2 b, \ldots)$$

At each step you add a vector of the form $A w$ and $A^{-1} w$ to the subspace, and orthogonalize the result (rational Arnoldi method).

I.e., we need only linear system solver for one step, but since the matrix $A$ is fixed, we can factorize it once

Rational Krylov methods

Rational Krylov methods are the most efficient for the computation of matrix functions:

$$KE(A, b) = \mathrm{Span}(\ldots, A^{-2} b, A^{-1} b, b, A b, A^2 b, \ldots)$$ $$f(A)b \approx Q f(H) Q^*b,$$

where $H = Q^* A Q.$

It requires one solver and matrix-by-vector product at each step.

Inverse square root of the matrix

$$e^{A^{-1} x, x}.$$ $$x = A^{-\frac{1}{2}} y.$$

Application to compute distance between manifolds

$$\mathrm{hkt}_{\mathcal{M}}(t) = \mathrm{trace}(\exp(-t L_{\mathcal{M}}))$$

contains all information about graph's spectrum

$$d_{GW}(\mathcal{M}, \mathcal{N}) \geq \sup_{t > 0} \exp(-2(t + t^{-1}))|\mathrm{hkt}_{\mathcal{M}}(t) - \mathrm{hkt}_{\mathcal{N}}(t)|$$

Stochastic trace estimator

$$ \mathrm{trace}(A) = \mathbb{E}_{p(x)}(x^{\top}Ax), $$

where $p(x)$ is distribution with zero mean and unit variance, e.g. Rademacher or standard normal distributions

Distances between languages (original paper)

Where do stochastic methods also help?

Randomized SVD (Halko et al, 2011)

$$ A \approx U\Sigma V^\top, $$

where $A$ is of size $m \times n$, $U$ is of size $m \times k$ and $V$ is of size $n \times k$.

$$A \approx Q Q^{\top}A $$

Randomized approximation of basis in column space of $A$

Covergence theorem

The averaged error of the presented algorithm, where $k$ is target rank and $p$ is oversampling parameter, is the following

$$ \mathbb{E}\|A - QQ^{\top}A \|_F \leq \left( 1 + \frac{k}{p-1} \right)^{1/2}\left( \sum_{j=k+1}^{\min(m, n)} \sigma^2_j \right)^{1/2} $$ $$ \mathbb{E}\|A - QQ^{\top}A \|_2 \leq \left( 1 + \sqrt{\frac{k}{p-1}} \right)\sigma_{k+1} + \frac{e\sqrt{k+p}}{p}\left( \sum_{j=k+1}^{\min(m, n)} \sigma^2_j \right)^{1/2} $$

The expectation is taken w.r.t. random matrix $G$ generated in the method described above.

Compare these upper bounds with Eckart-Young theorem. Are these bounds good?

Accuracy enhanced randomized SVD

$$ Y = (AA^{\top})^qAG \qquad Q, R = \mathtt{qr}(Y) $$

Loss of accuracy with rounding errors

Q: how can we battle with this issue?

A: sequential orthogonalization!

Convergence theorem

The presented above method provides the following upper bound

$$ \mathbb{E}\|A - QQ^{\top}A \|_2 \leq \left[\left( 1 + \sqrt{\frac{k}{p-1}} \right)\sigma^{2q+1}_{k+1} + \frac{e\sqrt{k+p}}{p}\left( \sum_{j=k+1}^{\min(m, n)} \sigma^{2(2q+1)}_j \right)^{1/2}\right]^{1/(2q+1)} $$

Consider the worst case, where no lowrank structure exists in the given matrix.

Q: what is the degree of suboptimality w.r.t. Eckart-Young theorem?

Summary on randomized SVD

Kaczmarz method to solve linear systems

$$a^{\top}_i x = f_i,$$

and given an approximation $x_k$ try to find $x_{k+1}$ as

$$x_{k+1} = \arg \min_x \frac12 \Vert x - x_k \Vert^2_2, \quad \mbox{s.t.} \quad a^{\top}_i x = f_i.$$ $$x_{k+1} = x_k - \frac{(a_i, x_k) - f_i}{(a_i, a_i)} a_i. $$

Convergence theorem

$$ \mathbb{E}[\|x_{k+1} - x^*\|^2_2] \leq \left(1 - \frac{1}{\kappa^2_F(A)}\right) \mathbb{E}[\|x_{k} - x^*\|^2_2], $$

where $\kappa_F(A) = \frac{\| A \|_F}{\sigma_{\min}(A)}$ and $\sigma_{\min}(A)$ is a minimal non-zero singular value of $A$. This result was presented in (Strohmer and Vershynin, 2009)

$$ \mathbb{E}[\|x_{k+1} - x^*\|^2_2] \leq \left(1 - \frac{1}{\kappa^2_F(A)}\right) \mathbb{E}[\|x_{k} - x^*\|^2_2] + \frac{\|r^*\|_2^2}{\| A \|^2_F}, $$

where $r^* = Ax^* - f$

Inconsistent overdetermined linear system

Here $a_{:, j}$ denotes the $j$-th column of $A$ and $a_{i, :}$ denotes the $i$-th row of $A$

Sampling and sketching

Summary on randomized methods in solving linear systems

Randomized matrix multiplication

Q: how to sample them?

A: generate probabilities from their norms!

$$ AB \approx \sum_{t=1}^k \frac{1}{kp_{i_t}} A^{(i_t)} B_{(i_t)}, $$

where $A^{(i_t)}$ is a column of $A$ and $B_{(i_t)}$ is a row of $B$

Approximation error

$$ \mathbb{E} [\|AB - CR\|^2_F] = \frac{1}{k} \left(\sum_{i=1}^n \| A^{(i)} \|_2 \| B_{(i)} \|_2\right)^2 - \frac{1}{k}\|AB\|_F^2 $$

Q: what matrices can be used?

Summary on randomized matmul

Take home message

Plan for the next class