Lecture 3: Matvecs and matmuls, memory hierarchy, Strassen algorithm

Recap of the previous lectures

Examples of peak performance

Flops –– floating point operations per second.

Giga = $2^{30} \approx 10^9$,
Tera = $2^{40} \approx 10^{12}$,
Peta = $2^{50} \approx 10^{15}$,
Exa = $2^{60} \approx 10^{18}$

What is the peak perfomance of:

  1. Modern CPU
  2. Modern GPU
  3. Largest supercomputer of the world?

Clock frequency of CPU vs. performance in flops

FLOPS = sockets (cores per socket) (number of clock cycles per second) * (number of floating point operations per cycle).

  1. Modern CPU (Intel Core i7) –– 400 Gflops
  2. Modern GPU (Nvidia Quadro RTX 8000) –– 16.3 Tflops single precision
  3. Largest supercomputer in the world –– 513.85 Pflops –– peak performanse

Matrix-by-vector multiplication (matvec)

Multiplication of an $n\times n$ matrix $A$ by a vector $x$ of size $n\times 1$ ($y=Ax$):

$$ y_{i} = \sum_{i=1}^n a_{ij} x_j $$

requires $n^2$ mutliplications and $n(n-1)$ additions. Thus, the overall complexity is $2n^2 - n =$ $\mathcal{O}(n^2)$

How bad is $\mathcal{O}(n^2)$?

\begin{align*} \frac{(10^{11})^2 \text{ operations}}{10^{16} \text{ flops}} = 10^6 \text{ sec} \approx 11.5 \text{ days} \end{align*}

for one time step. If we could multiply it with $\mathcal{O}(n)$ complexity, we would get

\begin{align*} \frac{10^{11} \text{ operations}}{10^{16} \text{ flops}} = 10^{-5} \text{ sec}. \end{align*}

Here is the YouTube video that illustrates collision of two galaxisies which was modelled by $\mathcal{O}(n \log n)$ algorithm:

Can we beat $\mathcal{O}(n^2)$?

Matrix-by-matrix product

Consider composition of two linear operators:

  1. $y = Bx$
  2. $z = Ay$

Then, $z = Ay = A B x = C x$, where $C$ is the matrix-by-matrix product.

Matrix-by-matrix product (MM): classics

Definition. A product of an $n \times k$ matrix $A$ and a $k \times m$ matrix $B$ is a $n \times m$ matrix $C$ with the elements
$$ c_{ij} = \sum_{s=1}^k a_{is} b_{sj}, \quad i = 1, \ldots, n, \quad j = 1, \ldots, m $$

For $m=k=n$ complexity of a naïve algorithm is $2n^3 - n^2 =$ $\mathcal{O}(n^3)$.

Discussion of MM

Efficient implementation for MM

Q1: Is it easy to multiply a matrix by a matrix in the most efficient way?

Answer: no, it is not easy

If you want it as fast as possible, using the computers that are at hand.

Demo

Let us do a short demo and compare a np.dot() procedure which in my case uses MKL with a hand-written matrix-by-matrix routine in Python and also its numba version.

Then we just compare computational times.

Guess the answer.

Is this answer correct for any dimensions of matrices?

Why is naïve implementation slow?

It is slow due to two issues:

Memory architecture

Making algorithms more computationally intensive

**Implementation in NLA**: use block version of algorithms.

This approach is a core of BLAS (Basic Linear Algebra Subroutines), written in Fortran many years ago, and still rules the computational world.

Split the matrix into blocks! For illustration consider splitting in $2 \times 2$ block matrix:

$$ A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}, \quad B = \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix}$$

Then,

$$AB = \begin{bmatrix}A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\end{bmatrix}.$$

If $A_{11}, B_{11}$ and their product fit into the cache memory (which is 12 Mb (L3) for the recent Intel Chip), then we load them only once into the memory.

BLAS

BLAS has three levels:

  1. BLAS-1, operations like $c = a + b$
  2. BLAS-2, operations like matrix-by-vector product
  3. BLAS-3, matrix-by-matrix product

What is the principal differences between them?

The main difference is the number of operations vs. the number of input data!

  1. BLAS-1: $\mathcal{O}(n)$ data, $\mathcal{O}(n)$ operations
  2. BLAS-2: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^2)$ operations
  3. BLAS-3: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^3)$ operations

Why BLAS is so important and actual?

  1. The state-of-the-art implementation of the basic linear algebra operations
  2. Provides standard names for operations in any new implementations (e.g. ATLAS, OpenBLAS, MKL). You can call matrix-by-matrix multiplication function (GEMM), link your code with any BLAS implementation and it will work correctly
  3. Formulate new algorithms in terms of BLAS operations
  4. There are wrappers for the most popular languages
  1. ATLAS - Automatic Tuned Linear Algebra Software. It automatically adapts to a particular system architechture.
  2. LAPACK - Linear Algebra Package. It provides high-level linear algebra operations (e.g. matrix factorizations), which are based on calls of BLAS subroutines.
  3. Intel MKL - Math Kernel Library. It provides re-implementation of BLAS and LAPACK, optimized for Intel processors. Available in Anaconda Python distribution:

    conda install mkl

    MATLAB uses Intel MKL by default.

  4. OpenBLAS is an optimized BLAS library based on GotoBLAS.

  5. PyTorch supports some calls from BLAS and LAPACK

  6. For GPU it was implemented special cuBLAS.

For comparison of OpenBLAS and Intel MKL, see this review

Faster algorithms for matrix multiplication

Recall that matrix-matrix multiplication costs $\mathcal{O}(n^3)$ operations. However, storage is $\mathcal{O}(n^2)$.

Question: is it possible to reduce number operations down to $\mathcal{O}(n^2)$?

Answer: a quest for $\mathcal{O}(n^2)$ matrix-by-matrix multiplication algorithm is not yet done.

Consider Strassen in more details.

Naïve multiplication

Let $A$ and $B$ be two $2\times 2$ matrices. Naïve multiplication $C = AB$

$$ \begin{bmatrix} c_{11} & c_{12} \\ c_{21} & c_{22} \end{bmatrix} = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} = \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{21} + a_{12}b_{22} \\ a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{21} + a_{22}b_{22} \end{bmatrix} $$

contains $8$ multiplications and $4$ additions.

Strassen algorithm

In the work Gaussian elimination is not optimal (1969) Strassen found that one can calculate $C$ using 18 additions and only 7 multiplications: $$ \begin{split} c_{11} &= f_1 + f_4 - f_5 + f_7, \\ c_{12} &= f_3 + f_5, \\ c_{21} &= f_2 + f_4, \\ c_{22} &= f_1 - f_2 + f_3 + f_6, \end{split} $$ where $$ \begin{split} f_1 &= (a_{11} + a_{22}) (b_{11} + b_{22}), \\ f_2 &= (a_{21} + a_{22}) b_{11}, \\ f_3 &= a_{11} (b_{12} - b_{22}), \\ f_4 &= a_{22} (b_{21} - b_{11}), \\ f_5 &= (a_{11} + a_{12}) b_{22}, \\ f_6 &= (a_{21} - a_{11}) (b_{11} + b_{12}), \\ f_7 &= (a_{12} - a_{22}) (b_{21} + b_{22}). \end{split} $$

Fortunately, these formulas hold even if $a_{ij}$ and $b_{ij}$, $i,j=1,2$ are block matrices.

Thus, Strassen algorithm looks as follows.

This leads us again to the divide and conquer idea.

Complexity of the Strassen algorithm

Number of multiplications

Calculation of number of multiplications is a trivial task. Let us denote by $M(n)$ number of multiplications used to multiply 2 matrices of sizes $n\times n$ using the divide and conquer concept. Then for naïve algorithm we have number of multiplications

$$ M_\text{naive}(n) = 8 M_\text{naive}\left(\frac{n}{2} \right) = 8^2 M_\text{naive}\left(\frac{n}{4} \right) = \dots = 8^{d-1} M(1) = 8^{d} = 8^{\log_2 n} = n^{\log_2 8} = n^3 $$

So, even when using divide and coquer idea we can not be better than $n^3$.

Let us calculate number of multiplications for the Strassen algorithm:

$$ M_\text{strassen}(n) = 7 M_\text{strassen}\left(\frac{n}{2} \right) = 7^2 M_\text{strassen}\left(\frac{n}{4} \right) = \dots = 7^{d-1} M(1) = 7^{d} = 7^{\log_2 n} = n^{\log_2 7} $$

Number of additions

There is no point to estimate number of addtitions $A(n)$ for naive algorithm, as we already got $n^3$ multiplications.
For the Strassen algorithm we have:

$$ A_\text{strassen}(n) = 7 A_\text{strassen}\left( \frac{n}{2} \right) + 18 \left( \frac{n}{2} \right)^2 $$

since on the first level we have to add $\frac{n}{2}\times \frac{n}{2}$ matrices 18 times and then go deeper for each of the 7 multiplications. Thus,

$$ \begin{split} A_\text{strassen}(n) =& 7 A_\text{strassen}\left( \frac{n}{2} \right) + 18 \left( \frac{n}{2} \right)^2 = 7 \left(7 A_\text{strassen}\left( \frac{n}{4} \right) + 18 \left( \frac{n}{4} \right)^2 \right) + 18 \left( \frac{n}{2} \right)^2 = 7^2 A_\text{strassen}\left( \frac{n}{4} \right) + 7\cdot 18 \left( \frac{n}{4} \right)^2 + 18 \left( \frac{n}{2} \right)^2 = \\ =& \dots = 18 \sum_{k=1}^d 7^{k-1} \left( \frac{n}{2^k} \right)^2 = \frac{18}{4} n^2 \sum_{k=1}^d \left(\frac{7}{4} \right)^{k-1} = \frac{18}{4} n^2 \frac{\left(\frac{7}{4} \right)^d - 1}{\frac{7}{4} - 1} = 6 n^2 \left( \left(\frac{7}{4} \right)^d - 1\right) \leqslant 6 n^2 \left(\frac{7}{4} \right)^d = 6 n^{\log_2 7} \end{split} $$

</font>

(since $4^d = n^2$ and $7^d = n^{\log_2 7}$).

Asymptotic behavior of $A(n)$ could be also found from the master theorem.