Lecture 7: Matrix decompositions and how we compute them

Recap of the previous lecture

  • Eigenvectors and eigenvalues
  • Characteristic polynomial and why it is a bad idea
  • Power method: $x := Ax$
  • Gershgorin theorem
  • Schur theorem: $A = U T U^*$
  • Normal matrices: $A^* A = A A^*$
  • Advanced topic: pseudospectrum

Today lecture

  • Today we will talk about matrix factorizations as general tool

  • The basic matrix factorizations in numerical linear algebra:

    • LU decomposition and Gaussian elimination — already covered
    • QR decomposition and Gram-Schmidt algorithm
    • Schur decomposition and QR-algorithm
    • Methods for computing SVD decomposition in the second part
  • We already introduced QR decomposition some time ago, but now we are going to discuss it in more details.

General concept of matrix factorization

  • In numerical linear algebra we need to solve different tasks, for example:

    • Solve linear systems $Ax = f$
    • Compute eigenvalues / eigenvectors
    • Compute singular values / singular vectors
    • Compute inverses, even sometimes determinants
    • Compute matrix functions like $\exp(A), \cos(A)$ (these are not elementwise functions)
  • In order to do this, we represent the matrix as a sum and/or product of matrices with simpler structure, such that we can solve mentioned tasks faster / in a more stable form.

  • What is a simpler structure?

What is a simpler structure

  • We already encountered several classes of matrices with structure.

  • For dense matrices the most important classes are

    • unitary matrices
    • upper/lower triangular matrices
    • diagonal matrices

Other classes of structured matrices

  • For sparse matrices the sparse constraints are often included in the factorizations.
  • For Toeplitz matrices an important class of matrices is the class of matrices with low displacement rank, which is based on the low-rank matrices.
  • The class of low-rank matrices and block low-rank matrices is everywhere.

Plan

The plan for today's lecture is to discuss the decompositions one-by-one and point out:

  • How to compute a particular decomposition
  • When the decomposition exists
  • What is done in real life (LAPACK).

Decompositions we want to discuss today

  • LU factorization & Cholesky factorization — quick reminder, already done.
  • QR decomposition and Gram-Schmidt algorithm
  • One slide about the SVD (more details in the second part of today lecture)

PLU decomposition

  • Any nonsingular matrix can be factored as
$$ A = P L U, $$

where $P$ is a permutation matrix, $L$ is a lower triangular matrix, $U$ is an upper triangular

  • Main goal of the LU decomposition is to solve linear systems, because
$$ A^{-1} f = (L U)^{-1} f = U^{-1} L^{-1} f, $$

and this reduces to the solution of two linear systems

$$ L y = f, \quad U x = y $$

with lower and upper triangular matrices respectively.

Positive definite matrices and Cholesky decomposition, reminder

If the matrix is Hermitian positive definite, i.e.

$$ A = A^*, \quad (Ax, x) > 0, \quad x \ne 0, $$

then it can be factored as

$$ A = RR^*, $$

where $R$ is lower triangular.

We will need this for the QR decomposition.

QR decomposition

  • The next decomposition: QR decomposition.
  • Again from the name it is clear that a matrix is represented as a product
$$ A = Q R, $$

where $Q$ is an column orthogonal (unitary) matrix and $R$ is upper triangular.

  • The matrix sizes: $Q$ is $n \times m$, $R$ is $m \times m$ if $n\geq m$. See our poster for visualization of QR decomposition

  • QR decomposition is defined for any rectangular matrix.

QR decomposition: applications

This decomposition plays a crucial role in many problems:

  • Computing orthogonal bases in a linear space
  • Used in the preprocessing step for the SVD
  • QR-algorithm for the computation of eigenvectors and eigenvalues (one of the 10 most important algorithms of the 20th century) is based on the QR decomposition
  • Solving overdetermined systems of linear equations

QR decomposition and least squares, reminder

  • Suppose we need to solve
$$ \Vert A x - f \Vert_2 \rightarrow \min_x, $$

where $A$ is $n \times m$, $n \geq m$.

  • Then we factorize
$$ A = Q R, $$

and use equation for pseudo-inverse matrix in the case of the full rank matrix $A$:

$$ x = A^{\dagger}b = (A^*A)^{-1}A^*b = ((QR)^*(QR))^{-1}(QR)^*b = (R^*Q^*QR)^{-1}R^*Q^*b = R^{-1}Q^*b. $$

thus $x$ can be recovered from

$$R x = Q^*b$$
  • Note that this is a square system of linear equations with lower triangular matrix. What is the complexity of solving this system?

Short side note: randomized least squares

  • One of the efficient ways to solve really overdetermined ($n\gg m$) system of linear equations is to use Kaczmarz method.

  • Instead of solving all equations, pick one randomly, which reads

$$a^{\top}_i x = f_i,$$

and given an approximation $x_k$ try to find $x_{k+1}$ as

$$x_{k+1} = \arg \min_x \Vert x - x_k \Vert, \quad \mbox{s.t.} \quad a^{\top}_i x = f_i.$$
  • A simple analysis gives
$$x_{k+1} = x_k - \frac{(a_i, x_k) - b_i}{(a_i, a_i)} a_i. $$
  • A cheap update, but the analysis is quite complicated.
  • You can recognize in this method stochastic gradient descent with specific step size equal to $\frac{1}{\|a_i\|_2^2}$ for every sample

Existence of QR decomposition

Theorem. Every rectangular $n \times m$ matrix has a QR decomposition.

There are several ways to prove it and compute it:

  • Theoretical: using the Gram matrices and Cholesky factorization
  • Geometrical: using the Gram-Schmidt orthogonalization
  • Practical: using Householder/Givens transformations

Proof using Cholesky decomposition

If we have the representation of the form

$$A = QR,$$

then $A^* A = ( Q R)^* (QR) = R^* (Q^* Q) R = R^* R$, the matrix $A^* A$ is called Gram matrix, and its elements are scalar products of the columns of $A$.

Proof using Cholesky decomposition (full column rank)

  • Assume that $A$ has full column rank. Then, it is simple to show that $A^* A$ is positive definite:
$$ (A^* A y, y) = (Ay, Ay) = \Vert Ay \Vert^2 > 0, \quad y\not = 0. $$
  • Therefore, $A^* A = R^* R$ always exists.

  • Then the matrix $A R^{-1}$ is unitary:

$$ (A R^{-1})^* (AR^{-1})= R^{-*} A^* A R^{-1} = R^{-*} R^* R R^{-1} = I. $$

Proof using Cholesky decomposition (rank-deficient case)

  • When an $n \times m$ matrix does not have full column rank, it is said to be rank-deficient.

  • The QR decomposition, however, also exists.

  • For any rank-deficient matrix there is a sequence of full-column rank matrices $A_k$ such that $A_k \rightarrow A$ (why?).

  • Each $A_k$ can be decomposed as $A_k = Q_k R_k$.

  • The set of all unitary matrices is compact, thus there exists a converging subsequence $Q_{n_k} \rightarrow Q$ (why?), and $Q^* A_k \rightarrow Q^* A = R$, which is triangular.

Stability of QR decomposition via Cholesky decomposition

  • So, the simplest way to compute QR decomposition is then
$$A^* A = R^* R,$$

and

$$Q = A R^{-1}.$$
  • It is a bad idea for numerical stability. Let us do some demo (for a submatrix of the Hilbert matrix).
In [11]:
import numpy as np
n = 20
r = 9
a = [[1.0 / (i + j + 0.5) for i in range(r)] for j in range(n)]
a = np.array(a)
q, Rmat = np.linalg.qr(a)
e = np.eye(r)
print('Built-in QR orth', np.linalg.norm(np.dot(q.T, q) - e))
gram_matrix = a.T.dot(a)
Rmat1 = np.linalg.cholesky(gram_matrix)
q1 = np.dot(a, np.linalg.inv(Rmat1.T))
print('Via Cholesky:', np.linalg.norm(np.dot(q1.T, q1) - e))
Built-in QR orth 1.1414333935044995e-15
Via Cholesky: 0.9960945952797225

Second way: Gram-Schmidt orthogonalization

  • QR decomposition is a mathematical way of writing down the Gram-Schmidt orthogonalization process.
  • Given a sequence of vectors $a_1, \ldots, a_m$ we want to find orthogonal basis $q_1, \ldots, q_m$ such that every $a_i$ is a linear combination of such vectors.

Gram-Schmidt:

  1. $q_1 := a_1/\Vert a_1 \Vert$
  2. $q_2 := a_2 - (a_2, q_1) q_1, \quad q_2 := q_2/\Vert q_2 \Vert$
  3. $q_3 := a_3 - (a_3, q_1) q_1 - (a_3, q_2) q_2, \quad q_3 := q_3/\Vert q_3 \Vert$
  4. And so on

Note that the transformation from $Q$ to $A$ has triangular structure, since from the $k$-th vector we subtract only the previous ones. It follows from the fact that the product of triangular matrices is a triangular matrix.

Modified Gram-Schmidt

  • Gram-Schmidt can be very unstable (i.e., the produced vectors will be not orthogonal, especially if $q_k$ has small norm).
    This is called loss of orthogonality.

  • There is a remedy, called modified Gram-Schmidt method. Instead of doing

$$q_k := a_k - (a_k, q_1) q_1 - \ldots - (a_k, q_{k-1}) q_{k-1}$$

we do it step-by-step. First we set $q_k := a_k$ and orthogonalize sequentially:

$$ q_k := q_k - (q_k, q_1)q_1, \quad q_k := q_{k} - (q_k,q_2)q_2, \ldots $$
  • In exact arithmetic, it is the same. In floating point it is absolutely different!

  • Note that the complexity is $\mathcal{O}(nm^2)$ operations

QR decomposition: the (almost) practical way

  • If $A = QR$, then
$$ R = Q^* A, $$

and we need to find a certain orthogonal matrix $Q$ that brings a matrix into upper triangular form.

  • For simplicity, we will look for an $n \times n$ matrix such that
$$ Q^* A = \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ & 0_{(n-m) \times m} \end{bmatrix} $$
  • We will do it column-by-column.
  • First, we find a Householder matrix $H_1 = (I - 2 uu^{\top})$ such that (we illustrate on a $4 \times 3$ matrix)
$$ H_1 A = \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & * & * \\ 0 & * & * \end{bmatrix} $$

Then,

$$ H_2 H_1 A = \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ 0 & 0 & * \end{bmatrix}, $$

where

$$ H_2 = \begin{bmatrix} 1 & 0 \\ 0 & H'_2, \end{bmatrix} $$

and $H'_2$ is a $3 \times 3$ Householder matrix.

Finally,

$$ H_3 H_2 H_1 A = \begin{bmatrix} * & * & * \\ 0 & * & * \\ 0 & 0 & * \\ 0 & 0 & 0 \end{bmatrix}, $$

You can try to implement it yourself, it is simple.

QR decomposition: real life

  • In reality, since this is a dense matrix factorization, you should implement the algorithm in terms of blocks (why?).

  • Instead of using Householder transformation, we use block Householder transformation of the form

$$H = I - 2UU^*, $$

where $U^* U = I$.

  • This allows us to use BLAS-3 operations.

Rank-revealing QR-decomposition

$$P A = Q R,$$

where $P$ is the permutation matrix (it permutes columns), and $R$ has the block form

$$R = \begin{bmatrix} R_{11} & R_{12} \\ 0 & R_{22}\end{bmatrix}.$$
  • The goal is to find $P$ such that the norm of $R_{22}$ is small, so you can find the numerical rank by looking at it.

  • An estimate is $\sigma_{r+1} \leq \Vert R_{22} \Vert_2$ (check why).

Summary

  • LU and QR decompositions can be computed using direct methods in finite amount of operations.

  • What about Schur form and SVD?

  • They can not be computed by direct methods (why?) they can only be computed by iterative methods.

  • Although iterative methods still have the same $\mathcal{O}(n^3)$ complexity in floating point arithmetic thanks to fast convergence.

Schur form

  • Recall that every matrix can be written in the Schur form
$$A = Q T Q^*,$$

with upper triangular $T$ and unitary $Q$ and this decomposition gives eigenvalues of the matrix (they are on the diagonal of $T$).

  • The first and the main algorithm for computing the Schur form is the QR algorithm.

QR algorithm

  • The QR algorithm was independently proposed in 1961 by Kublanovskaya and Francis.

  • Do not **mix** QR algorithm and QR decomposition!

  • QR decomposition is the representation of a matrix, whereas QR algorithm uses QR decomposition to compute the eigenvalues!

Way to QR algorithm

  • Consider the equation
$$A = Q T Q^*,$$

and rewrite it in the form

$$ Q T = A Q. $$
  • On the left we can see QR factorization of the matrix $AQ$.

  • We can use this to derive fixed-point iteration for the Schur form, also known as QR algorithm.

Derivation of the QR algorithm as fixed-point iteration

We can write down the iterative process

$$ Q_{k+1} R_{k+1} = A Q_k, \quad Q_{k+1}^* A = R_{k+1} Q^*_k $$

Introduce

$$A_k = Q^* _k A Q_k = Q^*_k Q_{k+1} R_{k+1} = \widehat{Q}_k R_{k+1}$$

and the new approximation reads

$$A_{k+1} = Q^*_{k+1} A Q_{k+1} = ( Q_{k+1}^* A = R_{k+1} Q^*_k) = R_{k+1} \widehat{Q}_k.$$

So we arrive at the standard form of the QR algorithm.

The final formulas are then written in the classical QRRQ-form:

  1. Start from $A_0 = A$.
  2. Compute QR factorization of $A_k = Q_k R_k$.
  3. Set $A_{k+1} = R_k Q_k$.

Iterate until $A_k$ is triangular enough (e.g. norm of subdiagonal part is small enough).

What about the convergence and complexity

Statement

Matrices $A_k$ are unitary similar to $A$

$$A_k = Q^*_{k-1} A_{k-1} Q_{k-1} = (Q_{k-1} \ldots Q_1)^* A (Q_{k-1} \ldots Q_1)$$

and the product of unitary matrices is a unitary matrix.

  • Complexity of each step is $\mathcal{O}(n^3)$, if a general QR decomposition is done.

  • Our hope is that $A_k$ will be very close to the triangular matrix for suffiently large $k$.

In [12]:
import numpy as np
n = 4
a = [[1.0/(i + j + 0.5) for i in range(n)] for j in range(n)]
niters = 200
for k in range(niters):
    q, rmat = np.linalg.qr(a)
    a = rmat.dot(q)
print('Leading 3x3 block of a:')
print(a[:3, :3])
Leading 3x3 block of a:
[[ 2.41052440e+000 -5.41127562e-017 -4.10963400e-017]
 [ 2.42500623e-168  3.49984625e-001  5.13648922e-017]
 [ 0.00000000e+000  6.56745067e-273  1.53236733e-002]]

Convergence and complexity of the QR algorithm

  • The convergence of the QR algorithm is from the largest eigenvalues to the smallest.

  • At least 2-3 iterations is needed for an eigenvalue.

  • Each step is one QR factorization and one matrix-by-matrix product, as a result $\mathcal{O}(n^3)$ complexity.

Q: does it mean $\mathcal{O}(n^4)$ complexity totally?

A: fortunately, not.

  • We can speedup the QR algorithm by using shifts, since $A_k - \lambda I$ has the same Schur vectors.

  • We will discuss these details later

A few words about the SVD

  • Last but not least: the singular value decomposition of matrices.
$$A = U \Sigma V^*.$$
  • We can compute it via eigendecomposition of
$$A^* A = V^* \Sigma^2 V,$$

and/or

$$AA^* = U^* \Sigma^2 U$$

with QR algorithm, but it is a bad idea (c.f. Gram matrices).

  • We will discuss methods for computing SVD later.

Summary

  • QR decomposition and Gram-Schmidt algorithm, reduction to a simpler form by Householder transformations
  • Schur decomposition and QR algorithm

Next steps

  • Efficient implementation of QR algorithm, convergence properties
  • Efficient computation of the SVD: 4 algorithms
  • More applications of the SVD

Questions?