Lecture 16: Iterative methods and preconditioners

Previous part

Now we will cover the following topics

What methods to use

More detailed flowchart from this book

MINRES

The MINRES method is GMRES applied to a symmetric system. We minimize

$$\Vert A Q_j x_j - f \Vert_2 = \Vert Q_j \widehat{x}_j + h_{j, j-1} q_j \widehat{x}_j - f \Vert_2 = \Vert Q_{j+1} \widehat{H}_{j+1} \widehat{x}_j - f \Vert_2 \rightarrow \min$$

which is equivalent to a linear least squares with an almost tridiagonal matrix

$$\Vert \widehat{H}_{j+1} x_{j} - \gamma e_0 \Vert_2 \rightarrow \min.$$

Difference between MINRES and CG

CG stores less vectors ($3$ instead of $5$).

Now, let us talk about non-symmetric systems.

Non-symmetric systems

$$A(x + e) = f, \quad Ae = f - Ax,$$

and generate the new Krylov subspace from the residual vector. This spoils the convergence, as we will see from the demo.

How to avoid such spoiling of convergence?

Idea of biconjugate gradient

Idea of BiCG method is to use the normal equations:

$$A^* A x = A^* f,$$

and apply the CG method to it.

Let us do some demo for a simple non-symmetric matrix to demonstrate instability of BiCG method.

BiConjugate Gradients

There are two options:

  1. Use $\mathcal{K}(A^* A, A^* f)$ to generate the subspace. That leads to square of condition number
  2. Instead, use two Krylov subspaces $\mathcal{K}(A)$ and $\mathcal{K}(A^*)$ to generate two basises that are biorthogonal (so-called biorthogonal Lanczos).

The goal is to compute the Petrov-Galerkin projection

$$W^* A V \widehat{x} = W^* f$$

with columns $W$ from the Krylov subspace of $A^*$, $V$ from $A$ (cf. with CG case).

That may lead to instabilities if we try to recompute the solutions in the efficient way. It is related to the pivoting (which we did not use in CG), and it is not naturally implemented here.

Notes about BiCG

A practical implementation of BiCG uses two-sided Lanczos process: generating Krylov subspace for $A$ and $A^{\top}$

In partucular

  1. $\alpha_j = \frac{(r_j, \hat{r}_j)}{(Ap_j, \hat{p}_j)}$
  2. $x_{j+1} = x_j + \alpha_j p_j $
  3. $r_{j+1} = r_j - \alpha_j Ap_j$
  4. $\hat{r}_{j+1} = \hat{r}_j - \alpha_j A^{\top}\hat{p}_j$
  5. $\beta_j = \frac{(r_{j+1}, \hat{r}_{j+1})}{(r_j, \hat{r}_j)}$
  6. $p_{j+1} = r_{j+1} + \beta_j p_j$
  7. $\hat{p}_{j+1} = \hat{r}_{j+1} - \beta_j \hat{p}_j$

Now we move to the stable version of the BiCG method

BiCGStab

A short demo to compare "Stabilized" vs "Non-stabilized" versions.

"Nonlinear GMRES" or Anderson acceleration

We can apply the GMRES-like idea to speed up the convergence of a given fixed-point iteration

$$x_{k+1} = \Phi(x_k).$$

This was actually older than the GMRES, and known as an Direct Inversion in Iterated Subspaces in Quantum Chemistry, or Anderson Acceleration.

Idea: use history for the update

$$x_{k+1} = \Phi(x_k) + \sum_{s=1}^m \alpha_s (x_{k - s} - \Phi(x_{k - s})), $$

and the parameters $\alpha_s$ are selected to minimize the norm of the residual

$$ \min_{\alpha} \left \| \sum_{s=1}^m \alpha_s (x_{k - s} - \Phi(x_{k - s})) \right\|_2, \quad \sum_{s=1}^m \alpha_s = 1$$

More details see in the original paper

Battling the condition number

Preconditioner: general concept

The general concept of the preconditioner is simple:

Given a linear system

$$A x = f,$$

we want to find the matrix $P_R$ and/or $P_L$ such that

  1. Condition number of $AP_R^{-1}$ (right preconditioner) or $P^{-1}_LA$ (right preconditioner) or $P^{-1}_L A P_R^{-1}$ is better than for $A$
  2. We can easily solve $P_Ly = g$ or $P_Ry = g$ for any $g$ (otherwise we could choose e.g. $P_L = A$)

Then we solve for (right preconditioner)

$$ AP_R^{-1} y = f \quad \Rightarrow \quad P_R x = y$$

or (left preconditioner)

$$ P_L^{-1} A x = P_L^{-1}f,$$

or even both $$ P_L^{-1} A P_R^{-1} y = P_L^{-1}f \quad \Rightarrow \quad P_R x = y.$$

The best choice is of course $P = A,$ but this does not make life easier.

One of the ideas is to use other iterative methods (beside Krylov) as preconditioners.

Other iterative methods as preconditioners

There are other iterative methods that we have not mentioned.

  1. Jacobi method
  2. Gauss-Seidel
  3. SOR($\omega$) (Successive over-relaxation) and its symmetric modification SSOR($\omega$)

Jacobi method (as preconditioner)

Consider again the matrix with non-zero diagonal. To get the Jacobi method you express the diagonal element:

$$a_{ii} x_i = -\sum_{i \ne j} a_{ij} x_j + f_i$$

and use this to iteratively update $x_i$:

$$ x_i^{(k+1)} = -\frac{1}{a_{ii}}\left( \sum_{i \ne j} a_{ij} x_j^{(k)} + f_i \right),$$

or in the matrix form

$$ x^{(k+1)} = D^{-1}\left((D-A)x^{(k)} + f\right) $$

where $D = \mathrm{diag}(A)$ and finally

$$ x^{(k+1)} = x^{(k)} - D^{-1}(Ax^{(k)} - f). $$

So, Jacobi method is nothing, but simple Richardson iteration with $\tau=1$ and left preconditioner $P = D$ - diagonal of a matrix. Therefore we will refer to $P = \mathrm{diag}(A)$ as Jacobi preconditioner. Note that it can be used for any other method like Chebyshev or Krylov-type methods.

Properties of the Jacobi preconditioner

Jacobi preconditioner:

  1. Very easy to compute and apply
  2. Works well for diagonally dominant matrices (remember the Gershgorin circle theorem!)
  3. Useless if all diagonal entries are the same (proportional to the identity matrix)

Gauss-Seidel (as preconditioner)

Another well-known method is Gauss-Seidel method.

Its canonical form is very similar to the Jacobi method, with a small difference. When we update $x_i$ as

$$x_i^{(k+1)} := -\frac{1}{a_{ii}}\left( \sum_{j =1}^{i-1} a_{ij} x_j^{(k+1)} +\sum_{j = i+1}^n a_{ij} x_j^{(k)} - f_i \right)$$

we use it in the later updates. In the Jacobi method we use the full vector from the previous iteration.

Its matrix form is more complicated.

Gauss-Seidel: matrix version

Given $A = A^{*} > 0$ we have

$$A = L + D + L^{*},$$

where $D$ is the diagonal of $A$, $L$ is lower-triangular part with zero on the diagonal.

One iteration of the GS method reads

$$ x^{(k+1)} = x^{(k)} - (L + D)^{-1}(Ax^{(k)} - f). $$

and we refer to the preconditioner $P = L+D$ as Gauss-Seidel preconditioner.

Good news: $\rho(I - (L+D)^{-1} A) < 1, $ where $\rho$ is the spectral radius,

i.e. for a positive definite matrix GS-method always converges.

Gauss-Seidel and coordinate descent

GS-method can be viewed as a coordinate descent method, applied to the energy functional

$$F(x) = (Ax, x) - 2(f, x)$$

with the iteration

$$x_i := \arg \min_z F(x_1, \ldots, x_{i-1}, z, x_{i+1}, \ldots, x_d).$$

Moreover, the order in which we eliminate variables, is really important!

Side note: Nonlinear Gauss-Seidel (a.k.a coordinate descent)

If $F$ is given, and we optimize one coordinate at a time, we have

$$x_i := \arg \min_z F(x_1, \ldots, x_{i-1}, z, x_{i+1}, \ldots, x_d).$$

Note the convergence result for block coordinate descent for the case of a general functional $F$:

it converges locally with the speed of the GS-method applied to the Hessian $$H = \nabla^2 F$$ of the functional.

Thus, if $F$ is twice differentiable and $x_*$ is the local minimum, then $H > 0$ can Gauss-Seidel converges.

Successive overrelaxation (as preconditioner)

We can even introduce a parameter $\omega$ into the GS-method preconditioner, giving a successive over-relaxation (SOR($\omega$)) method:

$$ x^{(k+1)} = x^{(k)} - \omega (D + \omega L)^{-1}(Ax^{(k)} - f). $$$$P = \frac{1}{\omega}(D+\omega L).$$

Preconditioners for sparse matrices

Remember the Gaussian elimination

$$A = P_1 L U P^{\top}_2, $$

where $P_1$ and $P_2$ are certain permutation matrices (which do the pivoting).

Incomplete LU

$$5 x_1 + x_4 + x_{10} = 1, \quad 3 x_1 + x_4 + x_8 = 0, \ldots,$$

and in all other equations $x_1$ are not present.

$$x_4 + x_8 + 3(1 - x_4 - x_{10})/5 = 0$$

Incomplete-LU: formal definition

We run the usual LU-decomposition cycle, but avoid inserting non-zeros other than the initial non-zero pattern.

L = np.zeros((n, n))
    U = np.zeros((n, n))
    for k in range(n): #Eliminate one row   
        L[k, k] = 1
        for i in range(k+1, n):
            L[i, k] = a[i, k] / a[k, k]
            for j in range(k+1, n):
                a[i, j] = a[i, j] - L[i, k] * a[k, j]  #New fill-ins appear here
        for j in range(k, n):
            U[k, j] = a[k, j]

ILU(k)

ILU(k): idea

LU & graphs

ILU Thresholded (ILUT)

A much more popular approach is based on the so-called thresholded LU.

You do the standard Gaussian elimination with fill-ins, but either:

It is denoted ILUT($\tau$).

Symmetric positive definite case

Second-order LU preconditioners

$$A \approx U_2 U^{\top}_2 + U^{\top}_2 R_2 + R^{\top}_2 U_2,$$

which is just the expansion of the $UU^{\top}$ with respect to the perturbation of $U$.

Summary of this part

Next lecture

Questions?