Lecture 15: Great Iterative Methods

Previous lecture

Plan for this part of class

Solution of linear systems and minimization of functionals

$$R(x) = \Vert A x - f \Vert^2_2.$$ $$A^* A x = A^* f,$$

thus it has squared condition number, so direct minimization of the residual by standard optimization methods is rarely used.

Energy functional

Let $A = A^* > 0$, then the following functional

$$\Phi(x) = (Ax, x) - 2(f, x)$$

is called energy functional.

Properties of energy functional

$$ \Phi(\alpha x + (1 - \alpha)y) < \alpha \Phi(x) + (1 - \alpha) \Phi(y)$$ $$A x_* = f.$$

Indeed,

$$\nabla \Phi = 2(Ax - f).$$

and the first order optimality condition $\nabla \Phi (x_*) = 0$ yields

$$A x_* = f.$$

Approximation of the solution by a subspace

$$A x \approx f, \quad x = x_0 + \sum_{k=1}^M c_k y_k,$$

where $c$ is the vector of coefficients.

$$(Ax, x) - 2(f, x)$$

subject to $$x = x_0 + Y c,$$

where $Y=[y_1,\dots,y_M]$ is $n \times M$ and vector $c$ has length $M$.

$$\widehat{\Phi}(c) = (A Y c, Y c) + 2 (Y^*Ax_0, c) - 2(f, Y c) = (Y^* A Y c, c) - 2(Y^* (f - Ax_0), c).$$ $$Y^* A Y c = Y^* (f - Ax_0) = Y^* r_0,$$

which is an $M \times M$ linear system with symmetric positive definite matrix if $Y$ has full column rank.

But how to choose $Y$?

Selection of the subspace: random projection

Selection of the subspace: Krylov subspace

In the Krylov subspace we generate the whole subspace from a single vector $r_0 = f - Ax_0$:

$$y_0\equiv k_0 = r_0, \quad y_1\equiv k_1 = A r_0, \quad y_2\equiv k_2 = A^2 r_0, \ldots, \quad y_{M-1}\equiv k_{M-1} = A^{M-1} r_0.$$

This gives the Krylov subpace of the $M$-th order

$$\mathcal{K}_M(A, r_0) = \mathrm{Span}(r_0, Ar_0, \ldots, A^{M-1} r_0).$$

Solution $x_*$ lies in the Krylov subspace: $x_* \in \mathcal{K}_n(A, f)$

Ill-conditioned of the natural basis

The natural basis in the Krylov subspace is very ill-conditioned, since

$$k_i = A^i r_0 \rightarrow \lambda_\max^i v,$$

where $v$ is the eigenvector, corresponding to the maximal eigenvalue of $A$, i.e. $k_i$ become more and more collinear for large $i$.

Solution: Compute orthogonal basis in the Krylov subspace.

Good basis in a Krylov subspace

In order to have stability, we first orthogonalize the vectors from the Krylov subspace using Gram-Schmidt orthogonalization process (or, QR-factorization).

$$K_j = \begin{bmatrix} r_0 & Ar_0 & A^2 r_0 & \ldots & A^{j-1} r_0\end{bmatrix} = Q_j R_j, $$

and the solution will be approximated as

$$x \approx x_0 + Q_j c.$$

Short way to Arnoldi relation

Statement. The Krylov matrix $K_j$ satisfies an important recurrent relation (called Arnoldi relation)

$$A Q_j = Q_j H_j + h_{j, j-1} q_j e^{\top}_{j-1},$$

where $H_j$ is upper Hessenberg, and $Q_{j+1} = [q_0,\dots,q_j]$ has orthogonal columns that spans columns of $K_{j+1}$.

Let us prove it (consider $j = 3$ for simplicity):

$$A \begin{bmatrix} k_0 & k_1 & k_2 \end{bmatrix} = \begin{bmatrix} k_1 & k_2 & k_3 \end{bmatrix} = \begin{bmatrix} k_0 & k_1 & k_2 \end{bmatrix} \begin{bmatrix} 0 & 0 & \alpha_0 \\ 1 & 0 & \alpha_1 \\ 0 & 1 & \alpha_2 \\ \end{bmatrix} + \begin{bmatrix} 0 & 0 & k_3 - \alpha_0 k_0 - \alpha_1 k_1 - \alpha_2 k_2 \end{bmatrix}, $$

where $\alpha_s$ will be selected later. Denote $\widehat{k}_3 = k_3 - \alpha_0 k_0 - \alpha_1 k_1 - \alpha_2 k_2$.

In the matrix form,

$$A K_3 = K_3 Z + \widehat k_3 e^{\top}_2,$$

where $Z$ is the lower shift matrix with the last column $(\alpha_0,\alpha_1,\alpha_2)^T$, and $e_2$ is the last column of the identity matrix.

Let

$$K_3 = Q_3 R_3$$

be the QR-factorization. Then,

$$A Q_3 R_3 = Q_3 R_3 Z + \widehat{k}_3 e^{\top}_2,$$

$$ A Q_3 = Q_3 R_3 Z R_3^{-1} + \widehat{k}_3 e^{\top}_2 R_3^{-1}.$$

Note that

$$e^{\top}_2 R_3^{-1} = \begin{bmatrix} 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \end{bmatrix} = \gamma e^{\top}_2,$$

and

$$R_3 Z R_3^{-1} = \begin{bmatrix} * & * & * \\* & * & * \\ 0 & * & * \\ \end{bmatrix},$$

in the general case it will be an upper Hessenberg matrix $H$, i.e. a matrix that

$$H_{ij} = 0, \quad \mbox{if } i > j + 1.$$

(Almost) Arnoldi relation

Let $Q_j$ be the orthogonal basis in the Krylov subspace, then we have almost the Arnoldi relation

$$A Q_j = Q_j H_j + \gamma\widehat{k}_j e^{\top}_{j-1},$$

where $H_j$ is an upper Hessenberg matrix, and

$$\widehat{k}_j = k_j - \sum_{s=0}^{j-1} \alpha_s k_s.$$

We select $\alpha_s$ in such a way that

$$Q^*_j \widehat{k}_j = 0.$$

Then, $\widehat{k}_j = h_{j, j-1} q_j,$ where $q_j$ is the last column of $Q_{j+1}$.

Arnoldi relation: final formula

We have

$$A Q_j = Q_j H_j + h_{j, j-1} q_j e^{\top}_{j-1}.$$

Lanczos process

If $A = A^*$, then

$$Q^*_j A Q_j = H_j, $$

thus $H_j$ is hermitian, and thus it is tridiagonal, $H_j = T_j$.

This gives a short-term recurrence relation to generate the Arnoldi vectors $q_j$ without full orthogonalization.

Lanczos process (2)

$$ A Q_j = Q_j T_j + t_{j, j-1} q_j e^{\top}_{j-1}.$$

In order to get $q_j$, we need to compute just the last column of

$$t_{j, j-1} q_j = (A Q_j - Q_j T_j) e_{j-1} = A q_{j-1} - t_{j-1, j-1} q_{j-1} - t_{j-2, j-1} q_{j-2}. $$

The coefficients $\alpha_j = t_{j-1, j-1}$ and $\beta_j = t_{j-2, j-1}$ can be recovered from orthogonality constraints

$(q_j, q_{j-1}) = 0, \quad (q_j, q_{j-2}) = 0$

All the other constraints will be satisfied automatically!!

And we only need to store two vectors to get the new one.

From direct Lanczos method to the conjugate gradient

We can now get from the Lanczos recurrence to the famous conjugate gradient method.

We have for $A = A^* > 0$

$$A Q_j = Q_j T_j + T_{j, j-1} q_j.$$

Recall that when we minimize energy functional in basis $Y$ we get a system $Y^* A Y c = Y^* f,$. Here $Y = Q_j$, so the approximate solution of $Ax \approx f$ with $x_j = x_0 + Q_j c_j$ can be found by solving a small system

$$Q^*_j A Q_j c_j = T_j c_j = Q^*_j r_0 .$$

Since $f$ is the first Krylov subspace, then Note!!! (recall what the first column in $Q_j$ is)

$$Q^*_j r_0 = \Vert r_0 \Vert_2 e_0 = \gamma e_0.$$

We have a tridiagonal system of equations for $c$:

$$T_j c_j = \gamma e_0$$

and $x_j = Q_j c_j$.

We could stop at this point, but we want short recurrent formulas instead of solving linear system with matrix $T_j$ at each step.

Derivation of the following update formulas is not required on the oral exam!

$$ T_j = \begin{bmatrix} a_1 & b_1 & & \\ b_1 & a_2 & b_2 & \\ & \ddots & \ddots & \ddots & \\ & & b_{j-1} & a_{j-1} & b_j \\ & & & b_j & a_j \end{bmatrix} = \begin{bmatrix} 1 & & & \\ c_1 & 1 & & \\ & \ddots & \ddots & & \\ & & c_{j-1} & 1 & \\ & & & c_j & 1 \end{bmatrix} \begin{bmatrix} d_1 & b_1 & & \\ & d_1 & b_2 & \\ & & \ddots & \ddots & \\ & & & d_{j-1} & b_j \\ & & & & d_j \end{bmatrix} $$

$$c_i = b_i/d_{i-1}, \quad d_i = \begin{cases} a_1, & \mbox{if } i = 1, \\ a_i - c_i b_i, & \mbox{if } i > 1. \end{cases}$$ $$x_j = Q_j T^{-1}_j \gamma e_0 = \gamma Q_j (L_j U_j)^{-1} e_0 = \gamma Q_j U^{-1}_j L^{-1}_j e_0.$$ $$P_j = Q_j U^{-1}_j, \quad z_j = \gamma L^{-1}_j e_0.$$ $$ x_j = P_j z_j$$ $$P_j = \begin{bmatrix} P_{j-1} & p_j \end{bmatrix}, $$

and

$$z_j = \begin{bmatrix} z_{j-1} \\ \xi_{j} \end{bmatrix}.$$ $$p_j = \frac{1}{d_j}\left(q_j - b_j p_{j-1} \right), \quad \xi_j = -c_j \xi_{j-1}.$$ $$x_j = P_j z_j = P_{j-1} z_{j-1} + \xi_j p_j = x_{j-1} + \xi_j p_j.$$

and $q_j$ are found from the Lanczos relation (see slides above).

Direct Lanczos method

We have the direct Lanczos method, where we store

$$p_{j-1}, q_j, x_{j-1}$$

to get a new estimate of $x_j$.

The main problem is with $q_j$: we have the three-term recurrence, but in the floating point arithmetic the orthogonality is can be lost, leading to numerical errors.

Let us do some demo.

Conjugate gradient method

Instead of $q_j$ (last vector in the modified Gram-Schmidt process), it is more convenient to work with the residual

$$r_j = f - A x_j.$$

The resulting recurrency has the form

$x_j = x_{j-1} + \alpha_{j-1} p_{j-1}$

$r_j = r_{j-1} - \alpha_{j-1} A p_{j-1}$

$p_j = r_j + \beta_j p_{j-1}$.

Hence the name conjugate gradient: to the gradient $r_j$ we add a conjugate direction $p_j$.

We have orthogonality of residuals (check!):

$$(r_i, r_j) = 0, \quad i \ne j$$

and A-orthogonality of conjugate directions (check!):

$$ (A p_i, p_j) = 0,$$

which can be checked from the definition.

The equations for $\alpha_j$ and $\beta_j$ can be now defined explicitly from these two properties.

CG final formulas

We have $(r_{j}, r_{j-1}) = 0 = (r_{j-1} - \alpha_{j-1} A p_{j-1}, r_{j-1})$,

thus

$$\alpha_{j-1} = \frac{(r_{j-1}, r_{j-1})}{(A p_{j-1}, r_{j-1})} = \frac{(r_{j-1}, r_{j-1})}{(A p_{j-1}, p_{j-1} - \beta_{j-1}p_{j-2})} = \frac{(r_{j-1}, r_{j-1})}{(A p_{j-1}, p_{j-1})}.$$

In the similar way, we have

$$\beta_{j-1} = \frac{(r_j, r_j)}{(r_{j-1}, r_{j-1})}.$$

Recall that

$x_j = x_{j-1} + \alpha_{j-1} p_{j-1}$

$r_j = r_{j-1} - \alpha_{j-1} A p_{j-1}$

$p_j = r_j + \beta_j p_{j-1}$.

Only one matrix-by-vector product per iteration.

CG derivation overview

Some history

More details here: https://www.siam.org/meetings/la09/talks/oleary.pdf

When Hestenes worked on conjugate bases in 1936, he was advised by a Harvard professor that it was too obvious for publication

Properties of the CG method

$A$-optimality

Energy functional can be written as

$$(Ax, x) - 2(f, x) = (A (x - x_*), (x - x_*)) - (Ax _*, x_*),$$

where $A x_* = f$. Up to a constant factor,

$$ (A(x - x_*), (x -x_*)) = \Vert x - x_* \Vert^2_A$$

is the A-norm of the error.

Convergence

The CG method computes $x_k$ that minimizes the energy functional over the Krylov subspace, i.e. $x_k = p(A)f$, where $p$ is a polynomial of degree $k+1$, so

$$\Vert x_k - x_* \Vert_A = \inf\limits_{p} \Vert \left(p(A) - A^{-1}\right) f \Vert_A. $$

Using eigendecomposition of $A$ we have

$$A = U \Lambda U^*, \quad g = U^* f,$$

and

$\Vert x - x_* \Vert^2_A = \displaystyle{\inf_p} \Vert \left(p(\Lambda) - \Lambda^{-1}\right) g \Vert_\Lambda^2 = \displaystyle{\inf_p} \displaystyle{\sum_{i=1}^n} \frac{(\lambda_i p(\lambda_i) - 1)^2 g^2_i}{\lambda_i} = \displaystyle{\inf_{q, q(0) = 1}} \displaystyle{\sum_{i=1}^n} \frac{q(\lambda_i)^2 g^2_i}{\lambda_i} $

Selection of the optimal $q$ depends on the eigenvalue distribution.

Absolute and relative error

We have

$$\Vert x - x_* \Vert^2_A \leq \sum_{i=1}^n \frac{g^2_i}{\lambda_i} \inf_{q, q(0)=1} \max_{j} q({\lambda_j})^2$$

The first term is just

$$\sum_{i=1}^n \frac{g^2_i}{\lambda_i} = (A^{-1} f, f) = \Vert x_* \Vert^2_A.$$

And we have relative error bound

$$\frac{\Vert x - x_* \Vert_A }{\Vert x_* \Vert_A} \leq \inf_{q, q(0)=1} \max_{j} |q({\lambda_j})|,$$

so if matrix has only 2 different eigenvalues, then there exists a polynomial of degree 2 such that $q({\lambda_1}) =q({\lambda_2})=0$, so in this case CG converges in 2 iterations.

  • If eigenvalues are clustered and there are $l$ outliers, then after first $\mathcal{O}(l)$ iterations CG will converge as if there are no outliers (and hence the effective condition number is smaller).

  • The intuition behind this fact is that after $\mathcal{O}(l)$ iterations the polynomial has degree more than $l$ and thus is able to zero $l$ outliers. </font>

Let us find another useful upper-bound estimate of convergence. Since

$$ \inf_{q, q(0)=1} \max_{j} |q({\lambda_j})| \leq \inf_{q, q(0)=1} \max_{\lambda\in[\lambda_\min,\lambda_\max]} |q({\lambda})| $$

The last term is just the same as for the Chebyshev acceleration, thus the same upper convergence bound holds:

$$\frac{\Vert x_k - x_* \Vert_A }{\Vert x_* \Vert_A} \leq \gamma \left( \frac{\sqrt{\mathrm{cond}(A)}-1}{\sqrt{\mathrm{cond}(A)}+1}\right)^k.$$