Lecture 5: Linear systems

Previous lecture

Today lecture

Linear systems

Linear equations and matrices

\begin{align*} &2 x + 3 y = 5\quad &\longrightarrow \quad &2x + 3 y + 0 z = 5\\ &2 x + 3z = 5\quad &\longrightarrow\quad &2 x + 0 y + 3 z = 5\\ &x + y = 2\quad &\longrightarrow\quad & 1 x + 1 y + 0 z = 2\\ \end{align*}

Matrix form

$$ \begin{pmatrix} 2 & 3 & 0 \\ 2 & 0 & 3 \\ 1 & 1 & 0 \\ \end{pmatrix}\begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} 5 \\ 5 \\ 2 \end{pmatrix} $$

or simply

$$ A u = f, $$

where $A$ is a $3 \times 3$ matrix and $f$ is right-hand side

Over/under determined linear systems

If the system $Au = f$ has

Existence of solutions

A solution to the linear system of equations with a square matrix $A$

$$A u = f$$

exists, iff

or

Scales of linear systems

In different applications, the typical size of the linear systems can be different.

Linear systems can be big

The main difficulty is that these systems are big: millions or billions of unknowns!

Linear systems can be structured

Q: how to work with such matrices?

A: fortunately, those matrices are structured and require $\mathcal{O}(N)$ parameters to be stored.

$$ \begin{pmatrix} 2 & -1 & 0 & 0 & 0 \\ -1 & 2 & -1 & 0 & 0 \\ 0 & -1 & 2 & -1 & 0 \\ 0 & 0 &-1& 2 & -1 \\ 0 & 0 & 0 & -1 & 2 \\ \end{pmatrix} $$

Main questions about linear systems

  1. What is the accuracy we get from the solution (due to rounding errors)?
  2. How we compute the solution? (LU-decomposition, Gaussian elimination)
  3. What is the complexity of the solution of linear systems?

How to solve linear systems?

Important: forget about determinants and the Cramer rule (it is good for $2 \times 2$ matrices still)!

How to solve linear systems?

The main tool is variable elimination.

\begin{align*} &2 y + 3 x = 5 \quad&\longrightarrow \quad &y = 5/2 - 3/2 x \\ &2 x + 3z = 5 \quad&\longrightarrow\quad &z = 5/3 - 2/3 x\\ &z + y = 2 \quad&\longrightarrow\quad & 5/2 + 5/3 - (3/2 + 2/3) x = 2,\\ \end{align*}

and that is how you find $x$ (and all previous ones).

This process is called Gaussian elimination and is one of the most widely used algorithms.

Gaussian elimination

Gaussian elimination consists of two steps:

  1. Forward step
  2. Backward step

Forward step

$$ x_1 = f_1 - (a_{12} x_2 + \ldots + a_{1n} x_n)/a_{11}, $$

and then substitute this into the equations $2, \ldots, n$.

Backward step

In the backward step:

Gaussian elimination and LU decomposition

Definition: LU-decomposition of the square matrix $A$ is the representation

$$A = LU,$$

where

This factorization is non-unique, so it is typical to require that the matrix $L$ has ones on the diagonal.

Main goal of the LU decomposition is to solve linear system, because

$$ A^{-1} f = (L U)^{-1} f = U^{-1} L^{-1} f, $$

and this reduces to the solution of two linear systems forward step

$$ L y = f, $$

and backward step

$$ U x = y. $$

Does $LU$ decomposition always exist?

Complexity of the Gaussian elimination/LU decomposition

Think a little bit: can Strassen algorithm help here?

Block LU-decomposition

We can try to compute block version of LU-decomposition:

$$\begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} = \begin{pmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{pmatrix} \begin{pmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{pmatrix} $$

Existence of the LU-decomposition

Q: when it is so, for which class of matrices?

A: it is true for strictly regular matrices.

Strictly regular matrices and LU decomposition

Corollary: If $L$ is unit triangular (ones on the diagonal), then $LU$-decomposition is unique.

Proof: Indeed, $L_1 U_1 = L_2 U_2$ means $L_2^{-1} L_1 = U_2 U_1^{-1}$. $L_2^{-1} L_1 $ is lower triangular with ones on the diagonal. $U_2 U_1^{-1}$ is upper triangular. Thus, $L_2^{-1} L_1 = U_2 U_1^{-1} = I$ and $L_1 = L_2$, $U_1 = U_2$.

LU for positive definite Hermitian matrices (Cholesky factorization)

Definition. A matrix $A$ is called positive definite if for any $x: \Vert x \Vert \ne 0$ we have

$$ (x, Ax) > 0. $$
$$A = RR^*,$$

where $R$ is a lower triangular matrix.

Computing LU-decomposition

Check:

When LU fails

$$ A = \begin{pmatrix} \varepsilon & 1 \\ 1 & 1 \end{pmatrix} $$

Let us do some demo here.

The concept of pivoting

$$A = P L U,$$

where $P$ is a permutation matrix.

Q. What makes row pivoting good?

A. It is made good by the fact that

$$ | L_{ij}|<1, $$

but the elements of $U$ can grow, up to $2^n$! (in practice, this is very rarely encountered).

Stability of linear systems

Let us illustrate this issue on the following example.

Questions from the demo

But before that we have to define the inverse.

Inverse: definition

$$ AX = XA = I, $$

where $I$ is the identity matrix (i.e., $I_{ij} = 0$ if $i \ne j$ and $1$ otherwise).

$$ A x_i = e_i,$$

where $e_i$ is the $i$-th column of the identity matrix.

Inverse matrix and linear systems

If we have computed $A^{-1}$, the solution of linear system

$$Ax = f$$

is just $x = A^{-1} f$.

Indeed,

$$ A(A^{-1} f) = (AA^{-1})f = I f = f. $$

Neumann series

Neumann series:

If a matrix $F$ is such that $\Vert F \Vert < 1$ holds, then the matrix $(I - F)$ is invertible and

$$(I - F)^{-1} = I + F + F^2 + F^3 + \ldots = \sum_{k=0}^{\infty} F^k.$$

Note that it is a matrix version of the geometric progression.

Q: what norm is considered here? What is the "best possible" norm here?

Proof

The proof is constructive. First of all, prove that the series $\sum_{k=0}^{\infty} F^k$ converges.

Like in the scalar case, we have

$$ (I - F) \sum_{k=0}^N F^k = (I - F^{N+1}) \rightarrow I, \quad N \to +\infty $$

Indeed,

$$ \| (I - F^{N+1}) - I\| = \|F^{N+1}\| \leqslant \|F\|^{N+1} \to 0, \quad N\to +\infty. $$

We can also estimate the norm of the inverse:

$$ \left\Vert \sum_{k=0}^N F^k \right\Vert \leq \sum_{k=0}^N \Vert F \Vert^k \Vert I \Vert \leq \frac{\Vert I \Vert}{1 - \Vert F \Vert} $$

Small perturbation of the inverse

$$(A + E)^{-1} = \sum_{k=0}^{\infty} (-A^{-1} E)^k A^{-1}$$

and moreover,

$$ \frac{\Vert (A + E)^{-1} - A^{-1} \Vert}{\Vert A^{-1} \Vert} \leq \frac{\Vert A^{-1} \Vert \Vert E \Vert \Vert I \Vert}{1 - \Vert A^{-1} E \Vert}. $$

As you see, the norm of the inverse enters the estimate.

Condition number of a linear system

Now consider the perturbed linear system:

$$ (A + \Delta A) \widehat{x} = f + \Delta f. $$

Estimates

$$ \begin{split} \widehat{x} - x &= (A + \Delta A)^{-1} (f + \Delta f) - A^{-1} f =\\ &= \left((A + \Delta A)^{-1} - A^{-1}\right)f + (A + \Delta A)^{-1} \Delta f = \\ & = \Big[ \sum_{k=0}^{\infty} (-A^{-1} E)^k - I \Big]A^{-1} f + \Big[\sum_{k=0}^{\infty} (-A^{-1} E)^k \Big]A^{-1} \Delta f \\ &= \Big[\sum_{k=1}^{\infty} (-A^{-1} \Delta A)^k\Big] A^{-1} f + \Big[\sum_{k=0}^{\infty} (-A^{-1} \Delta A)^k \Big] A^{-1} \Delta f, \end{split} $$

therefore

$$ \begin{split} \frac{\Vert \widehat{x} - x \Vert}{\Vert x \Vert} \leq &\frac{1}{\|A^{-1}f\|} \Big[ \frac{\|A^{-1}\|\|\Delta A\|}{1 - \|A^{-1}\Delta A\|}\|A^{-1}f\| + \frac{1}{1 - \|A^{-1} \Delta A\|} \|A^{-1} \Delta f\| \Big] \\ \leq & \frac{\|A\|\|A^{-1}\|}{1 - \|A^{-1}\Delta A\|} \frac{\|\Delta A\|}{\|A\|} + \frac{\|A^{-1}\|}{1 - \|A^{-1}\Delta A\|} \frac{\|\Delta f\|}{\|A^{-1}f\|}\\ \end{split} $$

Note that $\|AA^{-1}f\| \leq \|A\|\|A^{-1}f\|$, therefore $\| A^{-1} f \| \geq \frac{\|f\|}{\|A\|}$

Now we are ready to get the final estimate

$$ \begin{split} \frac{\Vert \widehat{x} - x \Vert}{\Vert x \Vert} \leq &\frac{\Vert A \Vert \Vert A^{-1} \Vert}{1 - \|A^{-1}\Delta A\|} \Big(\frac{\Vert\Delta A\Vert}{\Vert A \Vert} + \frac{\Vert \Delta f \Vert}{ \Vert f \Vert}\Big) \leq \\ \leq &\frac{\Vert A \Vert \Vert A^{-1} \Vert}{1 - \|A\|\|A^{-1}\|\frac{\|\Delta A\|}{\|A\|}} \Big(\frac{\Vert\Delta A\Vert}{\Vert A \Vert} + \frac{\Vert \Delta f \Vert}{ \Vert f \Vert}\Big) \equiv \\ \equiv &\frac{\mathrm{cond}(A)}{1 - \mathrm{cond}(A)\frac{\|\Delta A\|}{\|A\|}} \Big(\frac{\Vert\Delta A\Vert}{\Vert A \Vert} + \frac{\Vert \Delta f \Vert}{ \Vert f \Vert}\Big) \end{split} $$

The crucial role is played by the condition number $\mathrm{cond}(A) = \Vert A \Vert \Vert A^{-1} \Vert$.

Condition number

$$ \frac{\Vert \widehat{x} - x \Vert}{\Vert x \Vert} \leq \mathrm{cond}(A) \frac{\|\Delta f\|}{\|f\|} $$
$$ \mathrm{cond}_2 (A) = \|A\|_2 \|A^{-1}\|_2 = \frac{\sigma_{\max}}{\sigma_{\min}} $$

Hilbert matrix (again)

And with random right-hand side...

Can you think about an explanation?

Overdetermined linear systems

$$\Vert A x - b \Vert_2 \rightarrow \min$$

Overdetermined system and Gram matrix

The optimality condition is $0\equiv \nabla \left(\|Ax-b\|_2^2\right)$, where $\nabla$ denotes gradient. Therefore,

$$ 0 \equiv \nabla \left(\|Ax-b\|_2^2\right) = 2(A^*A x - A^*b) = 0. $$

Thus,

$$ \quad A^* A x = A^* b $$

Pseudoinverse

$$x = A^{\dagger} b.$$ $$A^{\dagger} = \lim_{\alpha \rightarrow 0}(\alpha I + A^* A)^{-1} A^* = (A^* A)^{-1} A^*. $$ $$A^{\dagger} = \lim_{\alpha \rightarrow 0}(\alpha I + A^* A)^{-1} A^* = (A^* A)^{-1} A^* = A^{-1} A^{-*} A^* = A^{-1}$$

Compute pseudoinverse via SVD

Let $A = U \Sigma V^*$ be the SVD of $A$. Then,

$$A^{\dagger} = V \Sigma^{\dagger} U^*,$$

where $\Sigma^{\dagger}$ consists of inverses of non-zero singular values of $A$. Indeed,

\begin{align*} A^{\dagger} &= \lim_{\alpha \rightarrow 0}(\alpha I + A^* A)^{-1} A^* = \lim_{\alpha \rightarrow 0}( \alpha VV^* + V \Sigma^2 V^*)^{-1} V \Sigma U^* \\ & = \lim_{\alpha \rightarrow 0}( V(\alpha I + \Sigma^2) V^*)^{-1} V \Sigma U^* = V \lim_{\alpha \rightarrow 0}(\alpha I + \Sigma^2)^{-1} \Sigma U^* = V \Sigma^{\dagger} U^*. \end{align*}

A canonical way to solve linear least squares

Is to use the $QR$ decomposition.

$$ A = Q R, $$

where $Q$ is unitary, and $R$ is upper triangular (details in the next lectures).

$$ x = A^{\dagger}b = (A^*A)^{-1}A^*b = ((QR)^*(QR))^{-1}(QR)^*b = (R^*Q^*QR)^{-1}R^*Q^*b = R^{-1}Q^*b. $$ $$ Rx = Q^* b. $$

Padding into a bigger system

$$A^* r = 0, \quad r = Ax - b,$$

or in the block form

$$ \begin{pmatrix} 0 & A^* \\ A & -I \end{pmatrix} \begin{pmatrix} x \\ r \end{pmatrix} = \begin{pmatrix} 0 \\ b \end{pmatrix}, $$

the total size of the system is $(n + m)$ square, and the condition number is the same as for $A$

Example of LS

Consider a two-dimensional example. Suppose we have a linear model

$$y = ax + b$$

and noisy data $(x_1, y_1), \dots (x_n, y_n)$. Then the linear system on coefficients will look as follows

$$ \begin{split} a x_1 &+ b &= y_1 \\ &\vdots \\ a x_n &+ b &= y_n \\ \end{split} $$

or in a matrix form

$$ \begin{pmatrix} x_1 & 1 \\ \vdots & \vdots \\ x_n & 1 \\ \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \\ \end{pmatrix}, $$

which represents overdetermined system.

Lacking for structure

Fortunately, the matrices in real-life are not dense, but have certain structure:

Summary

Next lecture