Lecture 10: Randomized linear algebra

Brief recap of the previous lecture

SVD and algorithms for its computations: divide-and-conquer, QR, Jacobi, bisection.

Todays lecture

Today, we will do a brief dive into the randomized NLA.

A good read is (https://arxiv.org/pdf/2002.01387.pdf)

Random numbers

All the computations that we considered up to today were deterministic.

However, reduction of complexity can be done by using randomized (stochastic) computation.

Example: randomized matrix multiplication.

Checking matrix equality

We can check, if $ A B = C$ in $\mathcal{O}(n^2)$ operations.

How?

Freivalds algorithm

Checks by multiplying by random vectors!

Complexity is $k n^2$, probability is of failure is $\frac{1}{2^k}$.

Matrix multiplication

But can we multiply matrices faster using randomization ideas?

Randomized matrix multiplication

Q: how to sample them?

A: generate probabilities from their norms!

$$ AB \approx \sum_{t=1}^k \frac{1}{kp_{i_t}} A^{(i_t)} B_{(i_t)}, $$

where $A^{(i_t)}$ is a column of $A$ and $B_{(i_t)}$ is a row of $B$

Approximation error

$$ \mathbb{E} [\|AB - CR\|^2_F] = \frac{1}{k} \left(\sum_{i=1}^n \| A^{(i)} \|_2 \| B_{(i)} \|_2\right)^2 - \frac{1}{k}\|AB\|_F^2 $$

Q: what matrices can be used?

Stochastic trace estimator

Many problems can be written in the form of the trace estimation:

$$\mathrm{Tr}(A) = \sum_{i} A_{ii}.$$

Can we compute the trace of the matrix if we only have access to matrix-by-vector products?

Two estimators

The randomized trace estimators can be computed from the following formula:

$$\mathrm{Tr}(A) = E_w w^* A w, \quad E ww^* = 1$$

In order to sample, we pick $k$ independent samples of $w_k$, get random variable $X_k$ and average the results.

Girard trace estimator: Sample $w \sim N(0, 1)$

Then, $\mathrm{Var} X_k = \frac{2}{k} \sum_{i, j=1}^n \vert A_{ij} \vert^2 = \frac{2}{k} \Vert A \Vert^2_F$

Hutchinson trace estimator: Let $w$ be a Rademacher random vector (i.e., elements are sampled from the uniform distribution.

It gives the minimal variance estimator.

Intdim

The variance of the trace can be estimated in terms of intrinsic dimension (intdim) for symmetric positive definite matrices.

It is defined as $\mathrm{intdim}(A) = \frac{\mathrm{Tr}(A)}{\Vert A \Vert_F}$. It is easy to show that

$$1 \leq \mathrm{intdim}(A) \leq r.$$

Then, the probability of the large deviation can be estimated as

$$P( \vert \overline{X}_k - \mathrm{Tr}(A) \vert \geq t \mathrm{Tr}(A)) \leq \frac{2}{k \mathrm{intdim}(A) t^2}$$

Better bounds for SPD matrices

If $A$ is SPD, then

$$P(\overline{X}_k \geq \tau \mathrm{Tr}(A) ) \leq \exp\left(-1/2 \mathrm{intdim}(A) (\sqrt{\tau} - 1)^2)\right) $$

Similar inequality holds for the lower bound.

This estimate is much better.

An interesting (and often mislooked) property of stochastic estimator is that it comes with a stochastic variance estimate (from samples!)

Warning: we still need $\varepsilon^{-2}$ samples to get to the accuracy $\varepsilon$ when using independent samples.

Distances between languages (original paper)

Where do stochastic methods also help?

Randomized SVD (Halko et al, 2011)

$$ A \approx U\Sigma V^\top, $$

where $A$ is of size $m \times n$, $U$ is of size $m \times k$ and $V$ is of size $n \times k$.

$$A \approx Q Q^{\top}A $$

Randomized approximation of basis in column space of $A$

Convergence theorem

The averaged error of the presented algorithm, where $k$ is target rank and $p$ is oversampling parameter, is the following

$$ \mathbb{E}\|A - QQ^{\top}A \|_F \leq \left( 1 + \frac{k}{p-1} \right)^{1/2}\left( \sum_{j=k+1}^{\min(m, n)} \sigma^2_j \right)^{1/2} $$ $$ \mathbb{E}\|A - QQ^{\top}A \|_2 \leq \left( 1 + \sqrt{\frac{k}{p-1}} \right)\sigma_{k+1} + \frac{e\sqrt{k+p}}{p}\left( \sum_{j=k+1}^{\min(m, n)} \sigma^2_j \right)^{1/2} $$

The expectation is taken w.r.t. random matrix $G$ generated in the method described above.

Compare these upper bounds with Eckart-Young theorem. Are these bounds good?

Accuracy enhanced randomized SVD

$$ Y = (AA^{\top})^qAG \qquad Q, R = \mathtt{qr}(Y) $$

Loss of accuracy with rounding errors

Q: how can we battle with this issue?

A: sequential orthogonalization!

Convergence theorem

The presented above method provides the following upper bound

$$ \mathbb{E}\|A - QQ^{\top}A \|_2 \leq \left[\left( 1 + \sqrt{\frac{k}{p-1}} \right)\sigma^{2q+1}_{k+1} + \frac{e\sqrt{k+p}}{p}\left( \sum_{j=k+1}^{\min(m, n)} \sigma^{2(2q+1)}_j \right)^{1/2}\right]^{1/(2q+1)} $$

Consider the worst case, where no lowrank structure exists in the given matrix.

Q: what is the degree of suboptimality w.r.t. Eckart-Young theorem?

Summary on randomized SVD

Kaczmarz method to solve linear systems

$$a^{\top}_i x = f_i,$$

and given an approximation $x_k$ try to find $x_{k+1}$ as

$$x_{k+1} = \arg \min_x \frac12 \Vert x - x_k \Vert^2_2, \quad \mbox{s.t.} \quad a^{\top}_i x = f_i.$$ $$x_{k+1} = x_k - \frac{(a_i, x_k) - f_i}{(a_i, a_i)} a_i. $$

Convergence theorem

$$ \mathbb{E}[\|x_{k+1} - x^*\|^2_2] \leq \left(1 - \frac{1}{\kappa^2_F(A)}\right) \mathbb{E}[\|x_{k} - x^*\|^2_2], $$

where $\kappa_F(A) = \frac{\| A \|_F}{\sigma_{\min}(A)}$ and $\sigma_{\min}(A)$ is a minimal non-zero singular value of $A$. This result was presented in (Strohmer and Vershynin, 2009)

$$ \mathbb{E}[\|x_{k+1} - x^*\|^2_2] \leq \left(1 - \frac{1}{\kappa^2_F(A)}\right) \mathbb{E}[\|x_{k} - x^*\|^2_2] + \frac{\|r^*\|_2^2}{\| A \|^2_F}, $$

where $r^* = Ax^* - f$

Inconsistent overdetermined linear system

Here $a_{:, j}$ denotes the $j$-th column of $A$ and $a_{i, :}$ denotes the $i$-th row of $A$

Sampling and sketching

Coherence

The key idea is the coherence of the matrix.

Let $A$ be $n \times r$ and $U$ be an orthogonal matrix whose columns form the basis of the column space of $A$.

Then, coherence is defined as

$$\mu(A) = \max \Vert U_{i, *} \Vert^2$$

Coherence is always smaller than $1$ and bigger than $\frac{r}{n}$, it has nothing to do with the condition number.

What does it mean?

Coherence

Small coherence means, that sampling rows uniformly gives a good preconditioner (will be covered later in the course, and why it is important)

One can do $S A = QR$, and look at the condition number of $AR^{-1}$.

Summary on randomized methods in solving linear systems

Summary on randomized matmul

Next lecture

Questions?