Lecture 4: Matrix rank, low-rank approximation, SVD

Previous lecture

Todays lecture

Matrix and linear spaces

$$ A = [a_1, \ldots, a_m], $$

where $a_m \in \mathbb{C}^{n\times 1}$.

$$ y = Ax \quad \Longleftrightarrow \quad y = a_1 x_1 + a_2 x_2 + \ldots +a_m x_m. $$

Linear dependence

Definition. Vectors $a_i$ are called linearly dependent, if there exist simultaneously non-zero coefficients $x_i$ such that

$$\sum_i a_i x_i = 0,$$

or in the matrix form

$$ Ax = 0, \quad \Vert x \Vert \ne 0. $$

In this case, we say that the matrix $A$ has a non-trivial nullspace (or kernel) denoted by $N(A)$ (or $\text{ker}(A)$).

Vectors that are not linearly dependent are called linearly independent.

Linear (vector) space

A linear space spanned by vectors $\{a_1, \ldots, a_m\}$ is defined as all possible vectors of the form

$$ \mathcal{L}(a_1, \ldots, a_m) = \left\{y: y = \sum_{i=1}^m a_i x_i, \, \forall x_i, \, i=1,\dots, n \right\}, $$

In the matrix form, the linear space is a set of all $y$ such that

$$y = A x.$$

This set is also called the range (or image) of the matrix, denoted by $\text{range}(A)$ (or $\text{im}(A)$) respectively.

Dimension of a linear space

Matrix rank

Theorem
The dimension of the column space of the matrix is equal to the dimension of its row space.

Proof

Full-rank matrix

Suppose, we have a linear space, spanned by $n$ vectors. Let these vectors be random with elements from standard normal distribution $\mathcal{N}(0, 1)$.

Q: What is the probability of the fact that this subspace has dimension $m < n$?

A: Random matrix has full rank with probability 1.

Dimensionality reduction

Johnson–Lindenstrauss lemma

Let $N\gg 1$. Given $0 < \epsilon < 1$, a set of $m$ points in $\mathbb{R}^N$ and $n > \frac{8 \log m}{\epsilon^2}$ (we want $n\ll N$).

Then there exists linear map $f$ from $\mathbb{R}^N \rightarrow \mathbb{R}^n$ such that the following inequality holds:

$$(1 - \epsilon) \Vert u - v \Vert^2 \leq \Vert f(u) - f(v) \Vert^2 \leq (1 + \epsilon) \Vert u - v \Vert^2.$$

Skeleton decomposition

A very useful representation for computation of the matrix rank is the skeleton decomposition and is closely related to the rank. This decompositions explains, why and how matrices of low rank can be compressed.

It can be graphically represented as follows:
or in the matrix form

$$ A = C \widehat{A}^{-1} R, $$

where $C$ are some $k=\mathrm{rank}(A)$ columns of $A$, $R$ are some $k$ rows of $A$ and $\widehat{A}$ is the nonsingular submatrix on the intersection.

Remark

We have not yet formally defined the inverse, so just a reminder:

Proof for the skeleton decomposition

$$\widehat{r}_i = \widehat{A} x_i \quad \Longrightarrow \quad x_i = \widehat{A}^{-1} \widehat r_i$$

Thus, $a_i = C\widehat{A}^{-1} \widehat r_i$ for every $i$ and

$$A = [a_1,\dots, a_m] = C\widehat{A}^{-1} R.$$

A closer look on the skeleton decomposition

$$A = C \widehat{A}^{-1} R,$$

where $C$ is $n \times r$, $R$ is $r \times m$ and $\widehat{A}$ is $r \times r$, or

$$ A = UV, $$

where $U$ and $V$ are not unique, e.g. $U = C \widehat{A}^{-1}$, $V=R$.

In the index form, it is

$$ a_{ij} = \sum_{\alpha=1}^r u_{i \alpha} v_{\alpha j}. $$

For rank 1, we have

$$ a_{ij} = u_i v_j, $$

i.e. it is a separation of indices and rank-$r$ is a sum of rank-$1$ matrices!

Storage

It is interesting to note, that for the rank-$r$ matrix

$$A = U V$$

only $U$ and $V$ can be stored, which gives us $(n+m) r$ parameters, so it can be used for compression. We can also compute matrix-by-vector $Ax$ product much faster:

The same works for addition, elementwise multiplication, etc. For addition:

$$ A_1 + A_2 = U_1 V_1 + U_2 V_2 = [U_1|U_2] [V_1^\top|V_2^\top]^\top $$

Computing matrix rank

We can also try to compute the matrix rank using the built-in jnp.linalg.matrix_rank function

So, small perturbations might crucially affect the rank!

Instability of the matrix rank

For any rank-$r$ matrix $A$ with $r < \min(m, n)$ there is a matrix $B$ such that its rank is equal to $\min(m, n)$ and

$$ \Vert A - B \Vert = \epsilon. $$

Q: So, does this mean that numerically matrix rank has no meaning? (I.e., small perturbations lead to full rank!)

A: No. We should find a matrix $B$ such that $\|A-B\| = \epsilon$ and $B$ has minimal rank. So we can only compute rank with given accuracy $\epsilon$. One of the approaches to compute matrix rank $r$ is SVD.

Low rank approximation

The important problem in many applications is to find low-rank approximation of the given matrix with given accurcacy $\epsilon$ or rank $r$.
Examples:

These problems can be solved by SVD.

Singular value decomposition

To compute low-rank approximation, we need to compute singular value decomposition (SVD).

Theorem Any matrix $A\in \mathbb{C}^{n\times m}$ can be written as a product of three matrices:

$$ A = U \Sigma V^*, $$

where

Proof

$$ V^* A^* A V = \text{diag}(\sigma_1^2,\dots, \sigma_n^2), \quad \sigma_1\geq \sigma_2\geq \dots \geq \sigma_n. $$

Let $\sigma_i = 0$ for $i>r$, where $r$ is some integer.
Let $V_r= [v_1, \dots, v_r]$, $\Sigma_r = \text{diag}(\sigma_1, \dots,\sigma_r)$. Hence

$$ V^*_r A^* A V_r = \Sigma_r^2 \quad \Longrightarrow \quad (\Sigma_r^{-1} V_r^* A^*) (A V_r\Sigma_r^{-1} ) = I. $$

As a result, matrix $U_r = A V_r\Sigma_r^{-1}$ satisfies $U_r^* U_r = I$ and hence has orthogonal columns.
Let us add to $U_r$ any orthogonal columns that are orthogonal to columns in $U_r$ and denote this matrix as $U$. Then

$$ AV = U \begin{bmatrix} \Sigma_r & 0 \\ 0 & 0 \end{bmatrix}\quad \Longrightarrow \quad U^* A V = \begin{bmatrix}\Sigma_r & 0 \\ 0 & 0 \end{bmatrix}. $$

Since multiplication by non-singular matrices does not change rank of $A$, we have $r = \text{rank}(A)$.

Corollary 1: $A = \displaystyle{\sum_{\alpha=1}^r} \sigma_\alpha u_\alpha v_\alpha^*$ or elementwise $a_{ij} = \displaystyle{\sum_{\alpha=1}^r} \sigma_\alpha u_{i\alpha} \overline{v}_{j\alpha}$

Corollary 2: $$\text{ker}(A) = \mathcal{L}\{v_{r+1},\dots,v_n\}$$

$$\text{im}(A) = \mathcal{L}\{u_{1},\dots,u_r\}$$$$\text{ker}(A^*) = \mathcal{L}\{u_{r+1},\dots,u_n\}$$$$\text{im}(A^*) = \mathcal{L}\{v_{1},\dots,v_r\}$$

Eckart-Young theorem

The best low-rank approximation can be computed by SVD.

Theorem: Let $r < \text{rank}(A)$, $A_r = U_r \Sigma_r V_r^*$. Then

$$ \min_{\text{rank}(B)=r} \|A - B\|_2 = \|A - A_r\|_2 = \sigma_{r+1}. $$

The same holds for $\|\cdot\|_F$, but $\|A - A_r\|_F = \sqrt{\sigma_{r+1}^2 + \dots + \sigma_{\min (n,m)}^2}$.

Proof

$$ \|A-B\|_2^2 \geq \|(A-B)z\|_2^2 = \|Az\|_2^2 = \| U\Sigma V^* z\|^2_2= \|\Sigma V^* z\|^2_2 = \sum_{i=1}^{n} \sigma_i^2 (v_i^*z)^2 =\sum_{i=1}^{r+1} \sigma_i^2 (v_i^*z)^2 \geq \sigma_{r+1}^2\sum_{i=1}^{r+1} (v_i^*z)^2 = \sigma_{r+1}^2 $$

as $\sigma_1\geq \dots \geq \sigma_{r+1}$ and $$\sum_{i=1}^{r+1} (v_i^*z)^2 = \|V^*z\|_2^2 = \|z\|_2^2 = 1.$$

Main result on low-rank approximation

Corollary: computation of the best rank-$r$ approximation is equivalent to setting $\sigma_{r+1}= 0, \ldots, \sigma_K = 0$. The error

$$ \min_{A_r} \Vert A - A_r \Vert_2 = \sigma_{r+1}, \quad \min_{A_r} \Vert A - A_r \Vert_F = \sqrt{\sigma_{r+1}^2 + \dots + \sigma_{K}^2} $$

that is why it is important to look at the decay of the singular values.

Computing SVD

Let us go back to the previous example

Separation of variables for 2D functions

We can use SVD to compute approximations of function-related matrices, i.e. the matrices of the form

$$a_{ij} = f(x_i, y_j),$$

where $f$ is a certain function, and $x_i, \quad i = 1, \ldots, n$ and $y_j, \quad j = 1, \ldots, m$ are some one-dimensional grids.

Function approximation

And 3d plots...

Singular values of a random Gaussian matrix

What is the singular value decay of a random matrix?

Linear factor analysis & low-rank

Consider a linear factor model,

$$y = Ax, $$

where $y$ is a vector of length $n$, and $x$ is a vector of length $r$.
The data is organized as samples: we observe vectors

$$y_1, \ldots, y_T,$$

but do not know matrix $A$, then the factor model can be written as

$$ Y = AX, $$

where $Y$ is $n \times T$, $A$ is $n \times r$ and $X$ is $r \times T$.

Applications of SVD

Dense matrix compression

Dense matrices typically require $N^2$ elements to be stored. A rank-$r$ approximation can reduces this number to $\mathcal{O}(Nr)$

Compression of parameters in fully-connected neural networs

Active Subspaces

$$ f(x) \approx g(W x). $$

How to discover Active Subspaces:

Using SVD:

  1. Choose $m$, the number of estimations. This hyperparameter stands for the number of Monte Carlo estimations. The larger $m$, the more accurate the result is.
  2. Draw samples $\lbrace x_i \rbrace_{i = 1}^{m}$ from $\mathcal{X}$ (according to some prior probability density function)
  3. For each $x_i$ compute $\nabla f(x_i)$
  4. Compute the SVD of the matrix
$$ G := \dfrac{1}{\sqrt{m}} \begin{bmatrix} \nabla f(x_1) & \nabla f(x_2) & \ldots & \nabla f(x_m) \end{bmatrix} \approx U \Sigma V^\top. $$
  1. Estimate the rank of $G \approx U_r \Sigma_r V_r^\top$. The rank $r$ of the matrix $G$ is the dimensionality of the active subspace.
  2. Low-dimensional vectors are estimated as $x_{\text{AS}} = U_r^\top x$.

For further details, look into the book „Active Subspaces: Emerging Ideas in Dimension Reduction for Parameter Studies“ (2015) by Paul Constantine.

Take home message

Next lecture

Questions?