Lecture 8: Matrix decompositions review. How to compute QR decomposition and Schur decomposition¶

Recap of the previous lecture¶

  • Eigenvectors and eigenvalues
  • Characteristic polynomial and why it is a bad idea
  • Power method to find leading (maximum absolute value) eigenvalue and eigenvector
  • Gershgorin theorem
  • Schur theorem: $A = U T U^*$
  • Normal matrices: $A^* A = A A^*$
  • Advanced topic: pseudospectrum

Today lecture¶

  • Today we will talk about matrix factorizations as general tool

  • Basic matrix factorizations in numerical linear algebra:

    • LU decomposition and Gaussian elimination — already covered
    • QR decomposition and Gram-Schmidt algorithm
    • Schur decomposition and QR-algorithm
    • Methods for computing SVD decomposition in the second part
  • We already introduced QR decomposition some time ago, but now we are going to discuss it in more details.

General concept of matrix factorization¶

  • In numerical linear algebra we need to solve different tasks, for example:

    • Solve linear systems $Ax = f$
    • Compute eigenvalues / eigenvectors
    • Compute singular values / singular vectors
    • Compute inverses, even sometimes determinants
    • Compute matrix functions like $\exp(A), \cos(A)$ (these are not elementwise functions)
  • In order to do this, we represent the matrix as a sum and/or product of matrices with simpler structure, such that we can solve mentioned tasks faster / in a more stable form.

  • What is a simpler structure?

What is a simpler structure¶

  • We already encountered several classes of matrices with structure.

  • For dense matrices the most important classes are

    • unitary matrices
    • upper/lower triangular matrices
    • diagonal matrices

Other classes of structured matrices¶

  • For sparse matrices the sparse constraints are often included in the factorizations.
  • For Toeplitz matrices an important class of matrices is the class of matrices with low displacement rank, which is based on the low-rank matrices.
  • The class of low-rank matrices and block low-rank matrices appears in many applications.

Plan¶

The plan for today's lecture is to discuss the decompositions one-by-one and point out:

  • How to compute a particular decomposition
  • When the decomposition exists
  • What is done in real life (LAPACK).

Decompositions we want to discuss today¶

  • LU factorization & Cholesky factorization — quick reminder, already done.
  • QR decomposition and Gram-Schmidt algorithm
  • One slide about the SVD (more details in the second part of today lecture)

PLU decomposition¶

  • Any nonsingular matrix can be factored as
$$ A = P L U, $$

where $P$ is a permutation matrix, $L$ is a lower triangular matrix, $U$ is an upper triangular.

  • Main goal of the LU decomposition is to solve linear systems, because
$$ A^{-1} f = (L U)^{-1} f = U^{-1} L^{-1} f, $$

and this reduces to the solution of two linear systems

$$ L y = f, \quad U x = y $$

with lower and upper triangular matrices respectively.

Q: what is the complexity of solving these linear systems?

Positive definite matrices and Cholesky decomposition, reminder¶

If the matrix is Hermitian positive definite, i.e.

$$ A = A^*, \quad (Ax, x) > 0, \quad x \ne 0, $$

then it can be factored as

$$ A = RR^*, $$

where $R$ is lower triangular.

We will need this for the QR decomposition.

QR decomposition¶

  • The next decomposition: QR decomposition.
  • Again from the name it is clear that a matrix is represented as a product
$$ A = Q R, $$

where $Q$ is an column orthogonal (unitary) matrix and $R$ is upper triangular.

  • The matrix sizes: $Q$ is $n \times m$, $R$ is $m \times m$ if $n\geq m$. See our poster for visualization of QR decomposition

  • QR decomposition is defined for any rectangular matrix.

QR decomposition: applications¶

This decomposition plays a crucial role in many problems:

  • Computing orthogonal bases in a linear space
  • Used in the preprocessing step for the SVD
  • QR-algorithm for the computation of eigenvectors and eigenvalues (one of the 10 most important algorithms of the 20th century) is based on the QR decomposition
  • Solving overdetermined systems of linear equations (linear least-squares problem)

QR decomposition and least squares, reminder¶

  • Suppose we need to solve
$$ \Vert A x - f \Vert_2 \rightarrow \min_x, $$

where $A$ is $n \times m$, $n \geq m$.

  • Then we factorize
$$ A = Q R, $$

and use equation for pseudo-inverse matrix in the case of the full rank matrix $A$:

$$ x = A^{\dagger}b = (A^*A)^{-1}A^*b = ((QR)^*(QR))^{-1}(QR)^*b = (R^*Q^*QR)^{-1}R^*Q^*b = R^{-1}Q^*b. $$

thus $x$ can be recovered from

$$R x = Q^*b$$
  • Note that this is a square system of linear equations with lower triangular matrix. What is the complexity of solving this system?

Existence of QR decomposition¶

Theorem. Every rectangular $n \times m$ matrix has a QR decomposition.

There are several ways to prove it and compute it:

  • Theoretical: using the Gram matrices and Cholesky factorization
  • Geometrical: using the Gram-Schmidt orthogonalization
  • Practical: using Householder/Givens transformations

Proof using Cholesky decomposition¶

If we have the representation of the form

$$A = QR,$$

then $A^* A = ( Q R)^* (QR) = R^* (Q^* Q) R = R^* R$, the matrix $A^* A$ is called Gram matrix, and its elements are scalar products of the columns of $A$.

Proof using Cholesky decomposition (full column rank)¶

  • Assume that $A$ has full column rank. Then, it is simple to show that $A^* A$ is positive definite:
$$ (A^* A y, y) = (Ay, Ay) = \Vert Ay \Vert^2 > 0, \quad y\not = 0. $$
  • Therefore, $A^* A = R^* R$ always exists.

  • Then the matrix $Q = A R^{-1}$ is unitary:

$$ (A R^{-1})^* (AR^{-1})= R^{-*} A^* A R^{-1} = R^{-*} R^* R R^{-1} = I. $$

Proof using Cholesky decomposition (rank-deficient case)¶

  • When an $n \times m$ matrix does not have full column rank, it is said to be rank-deficient.

  • The QR decomposition, however, also exists.

  • For any rank-deficient matrix there is a sequence of full-column rank matrices $A_k$ such that $A_k \rightarrow A$ (why?).

  • Each $A_k$ can be decomposed as $A_k = Q_k R_k$.

  • The set of all unitary matrices is compact, thus there exists a converging subsequence $Q_{n_k} \rightarrow Q$ (why?), and $Q^* A_k \rightarrow Q^* A = R$, which is triangular.

Stability of QR decomposition via Cholesky decomposition¶

  • So, the simplest way to compute QR decomposition is then
$$A^* A = R^* R,$$

and

$$Q = A R^{-1}.$$
  • It is a bad idea for numerical stability. Let us do some demo (for a submatrix of the Hilbert matrix).
In [6]:
import jax.numpy as jnp
from jax.config import config
config.update("jax_enable_x64", True)
n = 40
r = 9
a = [[1.0 / (i + j + 0.5) for i in range(r)] for j in range(n)]
a = jnp.array(a)
q, Rmat = jnp.linalg.qr(a)
e = jnp.eye(r)
print('Built-in QR orth', jnp.linalg.norm(jnp.dot(q.T, q) - e))
gram_matrix = a.T.dot(a)
Rmat1 = jnp.linalg.cholesky(gram_matrix)
#q1 = jnp.dot(a, jnp.linalg.inv(Rmat1.T))
q1 = jnp.linalg.solve(Rmat1, a.T).T
print('Via Cholesky:', jnp.linalg.norm(jnp.dot(q1.T, q1) - e))
Built-in QR orth 1.7302149860736299e-15
Via Cholesky: 0.9172162385797529

Second way: Gram-Schmidt orthogonalization¶

  • QR decomposition is a mathematical way of writing down the Gram-Schmidt orthogonalization process.
  • Given a sequence of vectors $a_1, \ldots, a_m$ we want to find orthogonal basis $q_1, \ldots, q_m$ such that every $a_i$ is a linear combination of such vectors.

Gram-Schmidt:

  1. $q_1 := a_1/\Vert a_1 \Vert$
  2. $q_2 := a_2 - (a_2, q_1) q_1, \quad q_2 := q_2/\Vert q_2 \Vert$
  3. $q_3 := a_3 - (a_3, q_1) q_1 - (a_3, q_2) q_2, \quad q_3 := q_3/\Vert q_3 \Vert$
  4. And so on

Note that the transformation from $Q$ to $A$ has triangular structure, since from the $k$-th vector we subtract only the previous ones. It follows from the fact that the product of triangular matrices is a triangular matrix.

Modified Gram-Schmidt¶

  • Gram-Schmidt can be very unstable (i.e., the produced vectors will be not orthogonal, especially if $q_k$ has small norm).
    This is called loss of orthogonality.

  • There is a remedy, called modified Gram-Schmidt method. Instead of doing

$$q_k := a_k - (a_k, q_1) q_1 - \ldots - (a_k, q_{k-1}) q_{k-1}$$

we do it step-by-step. First we set $q_k := a_k$ and orthogonalize sequentially:

$$ q_k := q_k - (q_k, q_1)q_1, \quad q_k := q_{k} - (q_k,q_2)q_2, \ldots $$
  • In exact arithmetic, it is the same. In floating point it is absolutely different!

  • Note that the complexity is $\mathcal{O}(nm^2)$ operations

QR decomposition: the (almost) practical way¶

  • If $A = QR$, then
$$ R = Q^* A, $$

and we need to find a certain orthogonal matrix $Q$ that brings a matrix into upper triangular form.

  • For simplicity, we will look for an $n \times n$ matrix such that
$$ Q^* A = \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ & 0_{(n-m) \times m} \end{bmatrix} $$
  • We will do it column-by-column.
  • First, we find a Householder matrix $H_1 = (I - 2 uu^{\top})$ such that (we illustrate on a $4 \times 3$ matrix)
$$ H_1 A = \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & * & * \\ 0 & * & * \end{bmatrix} $$

Then,

$$ H_2 H_1 A = \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ 0 & 0 & * \end{bmatrix}, $$

where

$$ H_2 = \begin{bmatrix} 1 & 0 \\ 0 & H'_2, \end{bmatrix} $$

and $H'_2$ is a $3 \times 3$ Householder matrix.

Finally,

$$ H_3 H_2 H_1 A = \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ 0 & 0 & 0 \end{bmatrix}, $$

You can try to implement it yourself, it is simple.

QR decomposition: real life¶

  • In reality, since this is a dense matrix factorization, you should implement the algorithm in terms of blocks (why?).

  • Instead of using Householder transformation, we use block Householder transformation of the form

$$H = I - 2UU^*, $$

where $U^* U = I$.

  • This allows us to use BLAS-3 operations.

Rank-revealing QR-decomposition¶

  • The QR-decomposition can be also used to compute the (numerical) rank of the matrix, see Rank-Revealing QR Factorizations and the Singular Value Decomposition, Y. P. Hong; C.-T. Pan

  • It is done via so-called rank-revealing factorization.

  • It is based on the representation

$$P A = Q R,$$

where $P$ is the permutation matrix (it permutes columns), and $R$ has the block form

$$R = \begin{bmatrix} R_{11} & R_{12} \\ 0 & R_{22}\end{bmatrix}.$$
  • The goal is to find $P$ such that the norm of $R_{22}$ is small, so you can find the numerical rank by looking at it.

  • An estimate is $\sigma_{r+1} \leq \Vert R_{22} \Vert_2$ (check why).

Summary¶

  • LU and QR decompositions can be computed using direct methods in finite amount of operations.

  • What about Schur form and SVD?

  • They can not be computed by direct methods (why?) they can only be computed by iterative methods.

  • Although iterative methods still have the same $\mathcal{O}(n^3)$ complexity in floating point arithmetic thanks to fast convergence.

Schur form¶

  • Recall that every matrix can be written in the Schur form
$$A = Q T Q^*,$$

with upper triangular $T$ and unitary $Q$ and this decomposition gives eigenvalues of the matrix (they are on the diagonal of $T$).

  • The first and the main algorithm for computing the Schur form is the QR algorithm.

QR algorithm¶

  • The QR algorithm was independently proposed in 1961 by Kublanovskaya and Francis.

  • Do not **mix** QR algorithm and QR decomposition!

  • QR decomposition is the representation of a matrix, whereas QR algorithm uses QR decomposition to compute the eigenvalues!

Way to QR algorithm¶

  • Consider the equation
$$A = Q T Q^*,$$

and rewrite it in the form

$$ Q T = A Q. $$
  • On the left we can see QR factorization of the matrix $AQ$.

  • We can use this to derive fixed-point iteration for the Schur form, also known as QR algorithm.

Derivation of the QR algorithm as fixed-point iteration¶

We can write down the iterative process

$$ Q_{k+1} R_{k+1} = A Q_k, \quad Q_{k+1}^* A = R_{k+1} Q^*_k $$

Introduce

$$A_k = Q^* _k A Q_k = Q^*_k Q_{k+1} R_{k+1} = \widehat{Q}_k R_{k+1}$$

and the new approximation reads

$$A_{k+1} = Q^*_{k+1} A Q_{k+1} = ( Q_{k+1}^* A = R_{k+1} Q^*_k) = R_{k+1} \widehat{Q}_k.$$

So we arrive at the standard form of the QR algorithm.

The final formulas are then written in the classical QRRQ-form:

  1. Start from $A_0 = A$.
  2. Compute QR factorization of $A_k = Q_k R_k$.
  3. Set $A_{k+1} = R_k Q_k$.

Iterate until $A_k$ is triangular enough (e.g. norm of subdiagonal part is small enough).

What about the convergence and complexity¶

Statement

Matrices $A_k$ are unitary similar to $A$

$$A_k = Q^*_{k-1} A_{k-1} Q_{k-1} = (Q_{k-1} \ldots Q_1)^* A (Q_{k-1} \ldots Q_1)$$

and the product of unitary matrices is a unitary matrix.

  • Complexity of each step is $\mathcal{O}(n^3)$, if a general QR decomposition is done.

  • Our hope is that $A_k$ will be very close to the triangular matrix for suffiently large $k$.

In [10]:
import jax.numpy as jnp
n = 5
a = [[1.0/(i - j + 0.5) for i in range(n)] for j in range(n)]
niters = 1000
for k in range(niters):
    q, rmat = jnp.linalg.qr(a)
    a = rmat.dot(q)
print('Leading 3x3 block of a:')
print(a[:4, :4])
Leading 3x3 block of a:
[[ 1.68842566e+00  2.39874769e+00 -4.91278296e-01  4.55588011e-01]
 [-2.27337624e+00  1.13938512e+00 -1.02951549e+00  5.70268353e-01]
 [-1.39868453e-15 -3.77840299e-15  2.37335976e+00  1.79215981e+00]
 [-4.11289213e-16 -5.73344858e-16 -9.26030829e-01  2.20080588e+00]]

Convergence and complexity of the QR algorithm¶

  • The convergence of the QR algorithm is from the largest eigenvalues to the smallest.

  • At least 2-3 iterations is needed for an eigenvalue.

  • Each step is one QR factorization and one matrix-by-matrix product, as a result $\mathcal{O}(n^3)$ complexity.

Q: does it mean $\mathcal{O}(n^4)$ complexity totally?

A: fortunately, not.

  • We can speedup the QR algorithm by using shifts, since $A_k - \lambda I$ has the same Schur vectors.

  • We will discuss these details later

Convergence of QR algorithm¶

Does QR algorithm always convergence?

Provide an example.

Counterexample¶

For a matrix $A = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$

we have $A_k = A$.

Connection to orthogonal iteration¶

In the previous lecture, we considered power iteration, which is $A^k v$ -- approximation of the eigenvector.

The QR algorithm computes (implicitly) QR-factorization of the matrix $A^k$:

$$A^k = A \cdot \ldots \cdot A = Q_1 R_1 Q_1 R_1 \ldots = Q_1 Q_2 R_2 Q_2 R_2 \ldots (R_2 R_1) = \ldots (Q_1 Q_2 \ldots Q_k) (R_k \ldots R_1).$$

A few words about the SVD¶

  • Last but not least: the singular value decomposition of matrix.
$$A = U \Sigma V^*.$$
  • We can compute it via eigendecomposition of
$$A^* A = V^* \Sigma^2 V,$$

and/or

$$AA^* = U^* \Sigma^2 U$$

with QR algorithm, but it is a bad idea (c.f. Gram matrices).

  • We will discuss methods for computing SVD later.

Bonus: CS (Cosine-Sine) decomposition¶

  • Let $Q$ be square unitary matrix with even number of rows and it is splitted in four equal size blocks
$$ Q = \begin{bmatrix} Q_{11} & Q_{12} \\ Q_{21} & Q_{22} \end{bmatrix} $$

  • Then there exist unitary matrices $U_1, U_2, V_1, V_2$ such that
$$ \begin{bmatrix} U_1^* & 0 \\ 0 & U_2^* \end{bmatrix} \begin{bmatrix} Q_{11} & Q_{12} \\ Q_{21} & Q_{22} \end{bmatrix} \begin{bmatrix} V_1 & 0 \\ 0 & V_2 \end{bmatrix} = \begin{bmatrix} C & S \\ -S & C \end{bmatrix}, $$

where $C = \mathrm{diag}(c)$ and $S = \mathrm{diag}(s)$ such that $c_i \geq 0, s_i \geq 0$ and $c_i^2 + s_i^2 = 1$

  • Q: how many SVD do we have inside the CS decomposition?

  • The case of rectangular matrix with orthonormal columns

$$ \begin{bmatrix} U_1^* & 0 \\ 0 & U_2^* \end{bmatrix} \begin{bmatrix} Q_{1} \\ Q_{2} \end{bmatrix} V = \begin{bmatrix} C \\ S \end{bmatrix} $$
  • The algorithm for computing this decomposition is presented here

  • This decomposition naturally arises in the problem of finding distances and angles between subspaces

Summary¶

  • QR decomposition and Gram-Schmidt algorithm, reduction to a simpler form by Householder transformations
  • Schur decomposition and QR algorithm

Next steps¶

  • Efficient implementation of QR algorithm and its convergence
  • Efficient computation of the SVD: 4 algorithms
  • More applications of the SVD

Questions?¶