Lecture 17: Tensors and tensor decompositions

Recap of the previous lecture

Today lecture

Tensors

By tensor we imply a multidimensional array:

$$ A(i_1, \dots, i_d), \quad 1\leq i_k\leq n_k, $$

where $d$ is called dimension, $n_k$ - mode size. This is standard definition in applied mathematics community. For details see [1], [2], [3].

Picture is taken from this presentation

More formal definition

Tensor is a multilinear form. When you fix the basis, you get a $d$-dimensional table.

Curse of dimensionality

The problem with multidimensional data is that number of parameters grows exponentially with $d$:

$$ \text{storage} = n^d. $$

For instance, for $n=2$ and $d=500$ $$ n^d = 2^{500} \gg 10^{83} - \text{ number of atoms in the Universe} $$

Why do we care? It seems that we are living in the 3D World :)

Applications

Quantum chemistry

Stationary Schroedinger equation for system with $N_{el}$ electrons

$$ \hat H \Psi = E \Psi, $$

where

$$ \Psi = \Psi(\{{\bf r_1},\sigma_1\},\dots, \{{\bf r_{N_{el}}},\sigma_{N_{el}}\}) $$

3$N_{el}$ spatial variables and $N_{el}$ spin variables.

Uncertainty quantification

Example: oil reservoir modeling. Model may depend on parameters $p_1,\dots,p_d$ (like measured experimentally porocity or temperature) with uncertainty

$$ u = u(t,{\bf r},\,{p_1,\dots,p_d}) $$

And many more

Working with many dimensions

How to work with high-dimensional functions?

Tensor decompositions

2D

Skeleton decomposition: $$ A = UV^T $$ or elementwise: $$ a_{ij} = \sum_{\alpha=1}^r u_{i\alpha} v_{j\alpha} $$ leads us to the idea of separation of variables.

Properties:

Canonical decomposition

The most straightforward way to generize separation of variables to many dimensions is the canonical decomposition: (CP/CANDECOMP/PARAFAC)

$$ a_{ijk} = \sum_{\alpha=1}^r u_{i\alpha} v_{j\alpha} w_{k\alpha} $$

minimal possible $r$ is called the canonical rank. Matrices $U$, $V$ and $W$ are called canonical factors. This decomposition was proposed in 1927 by Hitchcock, link.

Properties:

Alternating Least Squares algorithm

  1. Intialize random $U,V,W$
  2. fix $V,W$, solve least squares for $U$
  3. fix $U,W$, solve least squares for $V$
  4. fix $U,V$, solve least squares for $W$
  5. go to 2.

DNN compression (Lebedev, et. al 2015)

Tucker decomposition

Next attempt is the decomposition proposed by (Tucker, 1963) in Psychometrika:

$$ a_{ijk} = \sum_{\alpha_1,\alpha_2,\alpha_3=1}^{r_1,r_2,r_3}g_{\alpha_1\alpha_2\alpha_3} u_{i\alpha_1} v_{j\alpha_2} w_{k\alpha_3}. $$

Here we have several different ranks. Minimal possible $r_1,r_2,r_3$ are called Tucker ranks.

Properties:

Application in recommender systems (Frolov, Oseledets 2016)

CP and Tucker decompositions implementations

Tensor Train decomposition

Tensor Train (TT) decomposition (Oseledets, Tyrtyshnikov 2009 and Oseledets, 2011) is both stable and contains linear in $d$ number of parameters:

$$ a_{i_1 i_2 \dots i_d} = \sum_{\alpha_1,\dots,\alpha_{d-1}} g_{i_1\alpha_1} g_{\alpha_1 i_2\alpha_2}\dots g_{\alpha_{d-2} i_{d-1}\alpha_{d-1}} g_{\alpha_{d-1} i_{d}} $$

or in the matrix form

$$ a_{i_1 i_2 \dots i_d} = G_1 (i_1)G_2 (i_2)\dots G_d(i_d) $$

Example $$a_{i_1\dots i_d} = i_1 + \dots + i_d$$ Canonical rank is $d$. At the same time TT-ranks are $2$: $$ i_1 + \dots + i_d = \begin{pmatrix} i_1 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 \\ i_2 & 1 \end{pmatrix} \dots \begin{pmatrix} 1 & 0 \\ i_{d-1} & 1 \end{pmatrix} \begin{pmatrix} 1 \\ i_d \end{pmatrix} $$

Implementations

Using TT in Riemannien optimization (example is from t3f examples)

\begin{equation*} \begin{aligned} & \underset{X}{\text{minimize}} & & \frac{1}{2}\|X - A\|_F^2 \\ & \text{subject to} & & \text{tt_rank}(X) = r \end{aligned} \end{equation*}

Riemannian gradient descent

$$\hat{x}_{k+1} = x_{k} - \alpha P_{T_{x_k}\mathcal{M}} \nabla F(x_k),$$$$x_{k+1} = \mathcal{R}(\hat{x}_{k+1})$$

with $P_{T_{x_k}\mathcal{M}}$ being the projection onto the tangent space of $\mathcal{M}$ at the point $x_k$ and $\mathcal{R}$ being a retraction - an operation which projects points to the manifold, and $\alpha$ is the learning rate.

It is intructive to compare the obtained result with the quasioptimum delivered by the TT-round procedure.

We see that the value is slightly bigger than the exact minimum, but TT-round is faster and cheaper to compute, so it is often used in practice.

Exponential machines (Novikov et al 2017)

$$ \hat{y}_l(x) = \langle w, x \rangle + b$$

$$\hat{y}_{expm}(x) = \sum_{i_1 = 0}^1 \ldots \sum_{i_d = 0}^1 W_{i_1, \ldots, i_d} \prod_{k=1}^d x_k^{i_k}$$

Quantized Tensor Train

Consider a 1D array $a_k = f(x_k)$, $k=1,\dots,2^d$ where $f$ is some 1D function calculated on grid points $x_k$.

Let $$k = {2^{d-1} i_1 + 2^{d-2} i_2 + \dots + 2^0 i_{d}}\quad i_1,\dots,i_d = 0,1 $$ be binary representation of $k$, then

$$ a_k = a_{2^{d-1} i_1 + 2^{d-2} i_2 + \dots + 2^0 i_{d}} \equiv \tilde a_{i_1,\dots,i_d}, $$

where $\tilde a$ is nothing, but a reshaped tensor $a$. TT decomposition of $\tilde a$ is called Quantized Tensor Train (QTT) decomposition.

Interesting fact is that the QTT decomposition has relation to wavelets, more details see here.

Contains $\mathcal{O}(\log n r^2)$ elements!

Cross approximation method

Tensor networks

Summary

Next week