Lecture 11. Direct solvers for sparse matrices (cont'd). Intro to iterative methods

Previous lecture: what was already discussed

Today we will sequentially discuss approaches to minimize fill-in

Fill-in upper bound minimization: Markowitz pivoting

What if consider only neighbours? We get minimal degree ordering!

But in these methods we ignore the knowledge of good structure for sparse LU!

Let's exploit it explicitly in the method!

How to formalize reduction to block arrowhead form?

Definition. A separator in a graph $G$ is a set $S$ of vertices whose removal leaves at least two connected components.

Separator $S$ gives the following ordering for an $N$-vertex graph $G$:

Separator and block arrowhead structure: example

Separator for the 2D Laplacian matrix

$$ A_{2D} = I \otimes A_{1D} + A_{1D} \otimes I, \quad A_{1D} = \mathrm{tridiag}(-1, 2, -1), $$

is as follows

Once we have enumerated first indices in $\alpha$, then in $\beta$ and separators indices in $\sigma$ we get the following matrix

$$ PAP^\top = \begin{bmatrix} A_{\alpha\alpha} & & A_{\alpha\sigma} \\ & A_{\beta\beta} & A_{\beta\sigma} \\ A_{\sigma\alpha} & A_{\sigma\beta} & A_{\sigma\sigma}\end{bmatrix} $$

which has arrowhrad structure.

The method of recursive reduction to block arrowhead structure – Nested dissection

Calculation of Cholesky of this block costs $\mathcal{O}(n^3) = \mathcal{O}(N^{3/2})$, where $N = n^2$ is the total number of nodes.

So, the complexity is $\mathcal{O}(N^{3/2})$

Packages for nested dissection

All of them have interfaces for C/C++, Fortran and Matlab

Nested dissection summary

Separators in practice

Existing approaches:

One of the ways to construct separators – spectral graph partitioning

$$E_c(x) = \sum_{j} \sum_{i \in N(j)} (x_i - x_j)^2, \quad N(j) \text{ denotes set of neighbours of a node } j. $$ $$\sum_i x_i = 0 \quad \Longleftrightarrow \quad x^\top e = 0, \quad e = \begin{bmatrix}1 & \dots & 1\end{bmatrix}^\top,$$

and since we have +1/-1 labels, we have

$$\sum_i x^2_i = n \quad \Longleftrightarrow \quad \|x\|_2^2 = n.$$

Graph Laplacian

Cost $E_c$ can be written as (check why)

$$E_c = (Lx, x)$$

where $L$ is the graph Laplacian, which is defined as a symmetric matrix with

$$L_{ii} = \mbox{degree of node $i$},$$$$L_{ij} = -1, \quad \mbox{if $i \ne j$ and there is an edge},$$

and $0$ otherwise.

Partitioning as an optimization problem

$$E_c(x) = (Lx, x)\to \min_{\substack{x^\top e =0, \\ \|x\|_2^2 = n}}$$

From Fiedler vector to separator

$$ \min_{\substack{x^\top e =0, \\ \|x\|_2^2 = n}} (Lx, x) = n \cdot \min_{{x^\top e =0}} \frac{(Lx, x)}{(x, x)} = n \cdot \min_{{x^\top e =0}} R(x), \quad R(x) \text{ is the Rayleigh quotient} $$

Summary on demo

Fiedler vector and algebraic connectivity of a graph

Definition. The algebraic connectivity of a graph is the second-smallest eigenvalue of the Laplacian matrix.

Claim. The algebraic connectivity of a graph is greater than 0 if and only if a graph is a connected graph.

Practical problems

Computing bisection recursively is expensive.

As an alternative, one typically computes multilevel bisection that consists of 3 phases.

Practical problems (2)

Details in this survey

Take home message about direct methods for sparse matrices

The main topics for today

Concept of iterative methods for linear systems:

Iterative methods

Matrix as a black box

Richardson iteration

The simplest idea is the "simple iteration method" or Richardson iteration.

$$Ax = f,$$ $$\tau (Ax - f) = 0,$$ $$x - \tau (Ax - f) = x,$$ $$x_{k+1} = x_k - \tau (Ax_k - f),$$

where $\tau$ is the iteration parameter, which can be always chosen such that the method converges.

Connection to ODEs

$$\frac{dy}{dt} + A y = f, \quad y(0) = y_0.$$ $$\frac{y_{k+1} - y_k}{\tau} = -A y_k + f.$$

which leads to the Richardson iteration

$$ y_{k+1} = y_k - \tau(Ay_k -f) $$

Convergence of the Richardson method

$$ e_{k+1} = (I - \tau A) e_k, $$

therefore if $\Vert I - \tau A \Vert < 1$ in any norm, the iteration converges.

Optimal parameter choice

$$ \tau_\mathrm{opt} = \frac{2}{\lambda_{\min} + \lambda_{\max}}. $$

where $\lambda_{\min}$ is the minimal eigenvalue, and $\lambda_{\max}$ is the maximal eigenvalue of the matrix $A$.

Condition number and convergence speed

Even with the optimal parameter choice, the error at the next step satisfies

$$\|e_{k+1}\|_2 \leq q \|e_k\|_2 , \quad\rightarrow \quad \|e_k\|_2 \leq q^{k} \|e_0\|_2,$$

where

$$ q = \frac{\lambda_{\max} - \lambda_{\min}}{\lambda_{\max} + \lambda_{\min}} = \frac{\mathrm{cond}(A) - 1}{\mathrm{cond}(A)+1}, $$$$\mathrm{cond}(A) = \frac{\lambda_{\max}}{\lambda_{\min}} \quad \text{for} \quad A=A^*>0$$

is the condition number of $A$.

Let us do some demo...

Consider non-hermitian matrix $A$

Possible cases of Richardson iteration behaviour:

Q: how can we identify our case before running iterative method?

Better iterative methods

But before preconditioners, we can use better iterative methods.

There is a whole zoo of iterative methods, but we need to know just few of them.

Attempt 1: The steepest descent method

$$ x_{k+1} = x_k - \tau_k (A x_k - f). $$ $$ \tau_k = \arg\min_{\tau} \|A(x_k - \tau_k (A x_k - f)) - f\|_2^2.$$ $$ \tau_k = \frac{r_k^{\top}r_k}{r_k^{\top}Ar_k}, \quad r_k = Ax_k - f $$

Attempt 2: Chebyshev iteration

Another way to find $\tau_k$ is to consider

$$e_{k+1} = (I - \tau_k A) e_k = (I - \tau_k A) (I - \tau_{k-1} A) e_{k-1} = \ldots = p(A) e_0, $$

where $p(A)$ is a matrix polynomial (simplest matrix function)

$$ p(A) = (I - \tau_k A) \ldots (I - \tau_0 A), $$

and $p(0) = 1$.

Optimal choice of time steps

The error is written as

$$e_{k+1} = p(A) e_0, $$

and hence

$$\|e_{k+1}\| \leq \|p(A)\| \|e_0\|, $$

where $p(0) = 1$ and $p(A)$ is a matrix polynomial.

To get better error reduction, we need to minimize

$$\Vert p(A) \Vert$$

over all possible polynomials $p(x)$ of degree $k+1$ such that $p(0)=1$. We will use $\|\cdot\|_2$.

Polynomials least deviating from zeros

Important special case: $A = A^* > 0$.

Then $A = U \Lambda U^*$,

and

$$\Vert p(A) \Vert_2 = \Vert U p(\Lambda) U^* \Vert_2 = \Vert p(\Lambda) \Vert_2 = \max_i |p(\lambda_i)| \overset{!}{\leq} \max_{\lambda_\min \leq \lambda {\leq} \lambda_\max} |p(\lambda)|.$$

The latter inequality is the only approximation. Here we make a crucial assumption that we do not want to benefit from distribution of spectra between $\lambda_\min$ and $\lambda_\max$.

Thus, we need to find a polynomial such that $p(0) = 1$, that has the least possible deviation from $0$ on $[\lambda_\min, \lambda_\max]$.

Polynomials least deviating from zeros (2)

We can do the affine transformation of the interval $[\lambda_\min, \lambda_\max]$ to the interval $[-1, 1]$:

$$ \xi = \frac{{\lambda_\max + \lambda_\min - (\lambda_\min-\lambda_\max)x}}{2}, \quad x\in [-1, 1]. $$

The problem is then reduced to the problem of finding the polynomial least deviating from zero on an interval $[-1, 1]$.

Exact solution: Chebyshev polynomials

The exact solution to this problem is given by the famous Chebyshev polynomials of the form

$$T_n(x) = \cos (n \arccos x)$$

What do you need to know about Chebyshev polynomials

  1. This is a polynomial!

  2. We can express $T_n$ from $T_{n-1}$ and $T_{n-2}$:

$$T_n(x) = 2x T_{n-1}(x) - T_{n-2}(x), \quad T_0(x)=1, \quad T_1(x)=x$$
  1. $|T_n(x)| \leq 1$ on $x \in [-1, 1]$.

  2. It has $(n+1)$ alternation points, where the maximal absolute value is achieved (this is the sufficient and necessary condition for the optimality) (Chebyshev alternance theorem, no proof here).

  3. The roots are just

$$n \arccos x_k = \frac{\pi}{2} + \pi k, \quad \rightarrow\quad x_k = \cos \frac{\pi(2k + 1)}{2n}, \; k = 0, \ldots,n-1$$

We can plot them...

Convergence of the Chebyshev-accelerated Richardson iteration

Note that $p(x) = (1-\tau_n x)\dots (1-\tau_0 x)$, hence roots of $p(x)$ are $1/\tau_i$ and that we additionally need to map back from $[-1,1]$ to $[\lambda_\min, \lambda_\max]$. This results into

$$\tau_i = \frac{2}{\lambda_\max + \lambda_\min - (\lambda_\max - \lambda_\min)x_i}, \quad x_i = \cos \frac{\pi(2i + 1)}{2n}\quad i=0,\dots,n-1$$

The convergence (we only give the result without the proof) is now given by

$$ e_{k+1} \leq C q^k e_0, \quad q = \frac{\sqrt{\mathrm{cond}(A)}-1}{\sqrt{\mathrm{cond}(A)}+1}, $$

which is better than in the Richardson iteration.

What happened with great Chebyshev iterations?

Chebfun project

Beyond Chebyshev

Spectrum of the matrix contained in multiple segments

How can we make it better

$$r_k = A x_k - f.$$

Crucial point: Krylov subspace

The Chebyshev method produces the approximation of the form

$$x_{k+1} = x_0 + p(A) r_0,$$

i.e. it lies in the Krylov subspace of the matrix which is defined as

$$ \mathcal{K}_k(A, r_0) = \mathrm{Span}(r_0, Ar_0, A^2 r_0, \ldots, A^{k-1}r_0 ) $$

The most natural approach then is to find the vector in this linear subspace that minimizes certain norm of the error

Idea of Krylov methods

The idea is to minimize given functional:

To make methods practical one has to

  1. Orthogonalize vectors $A^i r_0$ of the Krylov subspace for stability (Lanczos process).
  2. Derive recurrent formulas to decrease complexity.

We will consider these methods in details on the next lecture.

Take home message

Questions?