Lecture 16: Iterative methods for large scale eigenvalue problems

Previous lecture

Partial eigenvalue problem

Power method

Recall that the simplest method to find the largest eigenvalue is the power method

$$ x_{i+1} = \frac{Ax_{i}}{\|Ax_{i}\|} $$

The convergence is linear with rate $q = \left|\frac{\lambda_1}{\lambda_2}\right|$.

Inverse iteration

To find the smallest eigenvalue one may run power method for $A^{-1}$:

$$x_{i+1} = \frac{A^{-1}x_{i}}{\|A^{-1}x_{i}\|}.$$

To accelerate convergence shift-and-invert strategy can be used:

$$x_{i+1} = \frac{(A-\sigma I)^{-1}x_{i}}{\|(A-\sigma I)^{-1}x_{i}\|},$$

where $\sigma$ should be close to the eigenvalue we want to find.

Rayleigh quotient (RQ) iteration

In order to get superlinear convergence one may use adaptive shifts:

$$x_{i+1} = \frac{(A-R(x_i) I)^{-1}x_{i}}{\|(A-R(x_i) I)^{-1}x_{i}\|},$$

where $R(x_k) = \frac{(x_i, Ax_i)}{(x_i, x_i)}$ is Rayleigh quotient.

The method converges cubically for Hermitian matrices and quadratically for non-Hermitian case.

Inexact inverse iteration framework

Block power method

The block power method (also known as subspace iteration method or simultaneous vector iteration) is a natural generalization of the power method for several largest eigenvalues.
It looks as:

  1. $Y_0$ is $N\times k$ matrix of rank $k$, $Y_0 = X_0 R_0$ (QR-decomposition)
  2. $Y_i = AX_{i-1}$
  3. $Y_i = X_i R_i$ (QR-decomposition)

QR-decomposition plays role of normalization in the standard power method.

Moreover, orthogonalization prevents the columns of the $X_i$ from converging all to the eigenvector corresponding to the largest modulus eigenvalue.

Accelerating convergence of the block power method

Before we go to advanced methods let us discuss the important concept of Ritz approximation.

Ritz approximation

Given subspace spanned by columns of unitary matrix $Q_k$ of size $N\times k$ we consider the projected matrix $Q_k^* A Q_k$.

Let $\Theta_k=\mathrm{diag}(\theta_1,\dots,\theta_k)$ and $S_k=\begin{bmatrix}s_1 & \dots & s_k \end{bmatrix}$ be matrices of eigenvalues and eigenvectors of $(Q_k^* A Q_k)$:

$$(Q_k^* A Q_k)S_k = S_k \Theta_k$$

then $\{\theta_i\}$ are called Ritz values and $y_i = Q_k s_i$ - Ritz vectors.

Properties of the Ritz approximation

Rayleigh-Ritz method

Thus, if a subspace $V$ approximates first $k$ eigenvectors, then one can use the Rayleigh-Ritz method:

  1. Find orthonormal basis $Q_k$ in $V$ (e.g. by using QR decomposition)
  2. Calculate $Q_k^*AQ_k$
  3. Compute Ritz values and vectors
  4. Note that alternatevly one could use $V$ with no orthogonalization, but then generalized eigenvalue problem $(V^*AV)s_i = \theta_i (V^*V)s_i$ has to be solved.

The question is how to find a good subspace $V$.

Lanczos and Arnoldi methods

The good choice for $V$ is the Krylov subspace.

Recall that in the power method we used only one Krylov vector

$$x_k = \frac{A^k x_0}{\|A^k x_0\|}.$$

In this case $\theta_k = \frac{x_k^* A x_k}{x_k^* x_k}$ is nothing but a Ritz value. Natural idea is to use a bigger Krylov subspace.

As a result we can find more eigenvalues (power method only gives $\lambda_\max$). Moreover,convergence of the eigenvalue corresponding to $\lambda_\max$ will be faster.

For Hermitian matrices from the Arnoldi relation we have

$$ Q_k^*AQ_k = T_k, $$

where $Q_k$ is orthogonal basis in the Krylov subspace generated by the Lanczos procedure and $T_k$ is triangular matrix.

According to the Rayleigh-Ritz method we expect that eigenvalues of $T_k$ approximate eigenvalues of $A$. This method is called the Lanczos method. For nonsymmetric matrices it is called the Arnoldi method and instead of tridiagonal $T_k$ we would get upper=Hessenberg matrix.

Let us show that $\theta_\max \approx\lambda_\max$.

Why is $\theta_\max \approx \lambda_\max$?

Let us denote $\theta_1 \equiv \theta_\max$ and $\lambda_1 \equiv \lambda_\max$. Then

$$ \theta_1 = \max_{y\in \mathcal{K}_i, y\not=0}\frac{(y,Ay)}{(y,y)} = \max_{p_{i-1}} \frac{(p_{i-1}(A)x_0, A p_{i-1}(A)x_0)}{(p_{i-1}(A)x_0, p_{i-1}(A)x_0)}, $$

where $p_{i-1}$ is a polynomial of degree not greater than $i-1$ such that $p_{i-1}(A)x_0\not=0$.

Expand $x_0 = \sum_{j=1}^N c_j v_j$, where $v_j$ are eigenvectors of $A$ (form orthonormal basis).

Since $\theta_1 \leq \lambda_1$ we get $$ \lambda_1 - \theta_1 \leq \lambda_1 - \frac{(p_{i-1}(A)x_0, A p_{i-1}(A)x_0)}{(p_{i-1}(A)x_0, p_{i-1}(A)x_0)} $$ for any polynomial $p_{i-1}$. Hence $$ \lambda_1 - \theta_1 \leq \lambda_1 - \frac{\sum_{k=1}^N \lambda_k |p_{i-1}(\lambda_k)|^2 |c_k|^2}{\sum_{k=1}^N |p_{i-1}(\lambda_k)|^2 |c_k|^2} = $$ $$ = \frac{\sum_{k=2}^N (\lambda_1 - \lambda_k) |p_{i-1}(\lambda_k)|^2 |c_k|^2}{|p_{i-1}(\lambda_1)|^2 |c_1|^2 + \sum_{k=2}^N |p_{i-1}(\lambda_k)|^2 |c_k|^2} \leq (\lambda_1 - \lambda_n) \frac{\max_{2\leq k \leq N}|p_{i-1}(\lambda_k)|^2}{|p_{i-1}(\lambda_1)|^2 }\gamma, \quad \gamma = \frac{\sum_{k=2}^N|c_k|^2}{|c_1|^2} $$

Since the inequality holds for any polynomial $p_{i-1}$ we will choose a polynomial:

$$|p_{i-1}(\lambda_1)| \gg \max_{2\leq k \leq N}|p_{i-1}(\lambda_k)|.$$

This holds, e.g. for the Chebyshev polynomial on $[\lambda_n,\lambda_2]$. Thus, $\theta_1 \approx \lambda_1$ or more precisely (Paige-Kaniel error bound, check it!): $$ \lambda_1 - \theta_1 \leq \frac{\lambda_1 - \lambda_n}{T_{i-1}^2(1 + 2\mu)}\gamma, \quad \mu = \frac{\lambda_1 - \lambda_2}{\lambda_2 - \lambda_n}, $$ where $T_{i-1}$ is a Chebyshev polynomial.

Demo: approximating largest eigenvalue with Lanczos

Practical issues and stability

More problems with the Lanczos method

An alternative to this approach are the so-called preconditioned iterative methods that include:

  1. PINVIT (Preconditioned Inverse Iteration)
  2. LOBCPG (Locally optimal block preconditioned CG)
  3. Jacobi-Davidson method

PINVIT (preconditioned inverse iteration)

Derivation

Consider Rayleigh quotient $R(x) = \frac{(x,Ax)}{(x,x)}$. Then $$ \nabla R(x) = \frac{2}{(x,x)} (Ax - R(x) x), $$

so the simplest gradient descent method with a preconditioner $B$ reads

$$ x_{i+1} = x_{i} - \tau_i B^{-1} (Ax_i - R(x_i) x_i), $$$$ x_{i+1} = \frac{x_{i+1}}{\|x_{i+1}\|}. $$

Typically $B\approx (A-\sigma I)$, where $\sigma$ is called shift.

The closer $\sigma$ to the required eigenvalue is, the faster the convergence.

Convergence theory

Theorem (Knyazev and Neymeyr)

Let

then

$$ \left|\frac{R(x_{i+1}) - \lambda_j}{R(x_{i+1}) - \lambda_{j+1}}\right| < \left[ 1 - (1-\gamma)\left(1 - \frac{\lambda_j}{\lambda_{j+1}}\right) \right]^2 \cdot \left|\frac{R(x_{i}) - \lambda_j}{R(x_{i}) - \lambda_{j+1}}\right| $$

Block case

To find, e.g. $k$ eigenvalues one can do a one step of PINVIT for each vector:

$$ x^{(j)}_{i+1} = x^{(j)}_{i} - \tau^{(j)}_i B^{-1} (Ax^{(j)}_i - R(x^{(j)}_i) x^{(j)}_i), \quad j=1,\dots,k $$$$ x^{(j)}_{i+1} = \frac{x^{(j)}_{i+1}}{\|x^{(j)}_{i+1}\|}. $$

And then orthogonalize them using the QR-decomposition. However, it is better to use the Rayleigh-Ritz procedure:

LOBPCG (Locally Optimal Block Preconditioned CG)

Locally optimal PCG (not "Block" so far :))

LOPCG method

$$ x_{i+1} = x_{i} - \alpha_i B^{-1} (Ax_i - R(x_i) x_i) + \beta_i x_{i-1} , $$$$ x_{i+1} = \frac{x_{i+1}}{\|x_{i+1}\|}. $$

is a superior to PINVIT method as it adds to basis not only $x_i$ and $r_i$, but also $x_{i-1}$.

However, this interpretation leads to an unstable algorithm as $x_{i}$ is becoming colinear to $x_{i-1}$ as the procedure converges.

LOPCG (stable version)

Knyazev suggested an equivalent stable version, which introduces new vectors $p_i$ (conjugate gradients)

$$ p_{i+1} = r_{i} + \beta_i p_{i}, $$$$ x_{i+1} = x_{i} + \alpha_i p_{i+1}. $$

One can check that $\mathcal{L}(x_{i},x_{i-1},r_{i})=\mathcal{L}(x_{i},p_{i},r_{i})$.

The stable version explains name of the method:

In standard CG method we would minimze Rayleigh quotient $R$ in the conjugate gradient direction $p_{i+1}$:

$$\alpha_i = \arg\min_{\alpha_i} R(x_i + \alpha_i p_{i+1}).$$

In the locally-optimal CG we minimize over two parameters:

$$\alpha_i, \beta_i = \arg\min_{\alpha_i,\beta_i} R\left(x_i + \alpha_i p_{i+1}\right) = \arg\min_{\alpha_i,\beta_i} R\left(x_i + \alpha_i (r_{i} + \beta_i p_{i})\right)$$

and we locally obtain more optimal solution. That is why the method is called locally optimal.

As for PINVIT coefficients $\alpha_i,\beta_i$ can be found by the Rayleigh-Ritz procedure.

Locally optimal block PCG

In the block version similarly to PINVIT on each iteration we are given basis $V=[X^{(i)}_k,B^{-1}R^{(i)}_k, P^{(i)}_k]$ and use Rayleigh-Ritz procedure.

The overall algorithm:

  1. Find $\tilde A = V^* A V$
  2. Find $\tilde M = V^*V$
  3. Solve generalized eigenvalue problem $\tilde A S_k = \tilde M S_k \Theta_k$
  4. $P^{(i+1)}_{k} = [B^{-1}R^{(i)}_k, P^{(i)}_k]S_k[:,k:]$
  5. $X^{(i+1)}_{k} = X^{(i)}_k S_k[:,:k] + P^{(i+1)}_{k}$ (equivalent to $X^{(i+1)}_{k} = VS_k$)
  6. Calculate new $B^{-1}R^{(i+1)}_k$
  7. Set $V=[X^{(i+1)}_k,B^{-1}R^{(i+1)}_k, P^{(i+1)}_k]$, goto 1.

Deflation technique which stops iterating converged eigestates can also be applied here.

The method also converges linearly, but faster than PINVIT.

LOBPCG summary

The next method (Jacobi-Davidson) has smart preconditioning and superlinear convergence (if systems are solved accurately)!

Jacobi-Davidson (JD) method

Jacobi-Davidson method is a very popular technique for solving eigvalue problems (not only symmetric!).

It consits of two key ingredients:

JD derivation

But we will derive it similarly to the original paper.

Jacobi correction equation

Jacobi not only presents the way to solve the eigenvalue problem by Jacobi rotations, but also proposed an iterative procedure. Let $x_j$ be the current approximation, and $t$ the correction:

$$A(x_j + t) = \lambda (x_j + t),$$

and we look for the correction $t \perp x_j$ (new orthogonal vector).

Then, the parallel part has the form

$$x_j x^*_j A (x_j + t) = \lambda x_j,$$

which simplifies to

$$R(x_j) + x^* _j A t = \lambda.$$

The orthogonal component is

$$( I - x_j x^*_j) A (x_j + t) = (I - x_j x^*_j) \lambda (x_j + t),$$

which is equivalent to

$$ (I - x_j x^*_j) (A - \lambda I) t = (I - x_j x^*_j) (- A x_j + \lambda x_j) = - (I - x_j x^*_j) A x_j = - (A - R(x_j) I) x_j = -r_j. $$

$r_j$ is the residual.

Since $(I - x_j x^*_j) t = t$, we can rewrite this equation in the symmetric form

$$ (I - x_j x^*_j) (A - \lambda I) (I - x_j x^*_j) t = -r_j.$$

Now we replace $\lambda$ by $R(x_j)$, and get the Jacobi correction equation:

$$ (I - x_j x^*_j) (A - R(x_j) I) (I - x_j x^*_j) t = -r_j. $$

Since $r_j \perp x_j$ this equation is consistent, if $(A - R(x_j) I)$ is non-singular.

Solving Jacobi correction equation

Typically Jacobi equation is solved inexactly by the appropriate Krylov method.

Even inexact solution of Jacobi equation ensures (why?) that the correction $t$ is orthogonal to $x_j$, which is good for computations.

Connection to the Rayleigh quotient iteration

If this equation is solved exactly, we will get Rayleigh quotient iteration! Let us show that.

$$ (I - x_j x^*_j) (A - R(x_j) I) (I - x_j x^*_j) t = -r_j.$$$$ (I - x_j x^*_j) (A - R(x_j) I) t = -r_j.$$$$ (A - R(x_j) I) t - \alpha x_j = -r_j, \quad \alpha = x^*_j (A - R(x_j) I) x_j$$$$ t = \alpha (A - R(x_j) I)^{-1}x_j - (A - R(x_j) I)^{-1}r_j,$$

Thus, since $(A - R(x_j) I)^{-1}r_j = (A - R(x_j) I)^{-1}(A - R(x_j) I)x_j = x_j$ we get

$$x_{j+1} = x_j + t = \alpha (A - R(x_j) I)^{-1}x_j,$$

which is Rayleigh quotient iteration up to normalization.

Preconditioning of Jacobi equation

A popular preconditioner for solving Jacobi equation by Krylov method has the form

$$ \widetilde K = (I - x_j x^*_j) K (I - x_j x^*_j) $$

where $K$ is easily-inverted approximation of $(A - R(x_j) I)$.

We need to derive how to solve a system with $\widetilde K$ in terms of solving a system with $K$.

We already showed that equation

$$ (I - x_j x^*_j) K (I - x_j x^*_j) \tilde t = f $$

is equavelnt to

$$ \tilde t = \alpha K^{-1}x_j + K^{-1}f $$

The trick now is to forget about the value of $\alpha$ and find it from $\tilde t\perp x_j$ to maintain orthogonality:

$$ \alpha = \frac{x_j^*K^{-1}f}{x_j^* K^{-1}x_j} $$

Thus for each iteration of the Jacobi equation we calculate $K^{-1}x_j$ and then update only $K^{-1}f$ on each internal Krylov iteration

Subspace acceleration in JD

On each iteration of the method we expand a basis with new $t$.

Namely, $V_j = [v_1,\dots,v_{j-1},v_j]$, where $v_j$ is vector $t$ orthogonalized to $V_{j-1}$.

Then standard Rayleigh-Ritz procedure is used.

Historal fact: Initially subspace acceleration was used in the Davidson method.

However, instead of the Jacobi equation, equation $(\mathrm{diag}(A) - R(x_j)I)t = -r_j$ was used.
Davidson method was very popular in quantum chemistry computations.

The block case of JD

If we want many eigenvectors, we just compute partial Schur decomposition:

$$A Q_k = Q_k T_k, $$

and then want to update $Q_k$ by one vector added to $Q_k$. We just use instead of $A$ the matrix $(I - Q_k Q^*_k) A (I - Q_k Q^*_k)$.

Jacobi-Davidson: summary

The correction equation can be solved only roughly, and JD method is often the fastest.

Software

Comparison of Lanczos, LOBPCG and PRIMME

Take-home message about large scale partial eigenvalue problem

Intro to streaming algorithms

Oja's method

$$ \max_{W \in \mathbb{R}^{n \times k}, W^{\top}W = I} \mathrm{trace}(W^{\top}\hat{\Sigma}W), $$

where $\hat{\Sigma}$ is sampled covariance matrix.

Sketching

Next lecture