Lecture 9: From dense to sparse linear algebra

Recap of the previous lectures

Large scale dense matrices

Distributed memory and MPI

Example: matrix-by-vector product

1D blocking scheme

Total time of computing matvec with 1D blocking

2D blocking scheme

Total time of computing matvec with 2D blocking

Packages supported distributed storage

In Python you can use mpi4py for parallel programming of your algorithm.

Summary on large unstructered matrix processing

Sparse matrices intro

The question if we can:

with sparse matrices

Plan for the next part of the lecture

Now we will talk about sparse matrices, where they arise, how we store them, how we operate with them.

Applications of sparse matrices

Sparse matrices arise in:

Sparse matrices are ubiquitous in PDE

The simplest partial differential equation (PDE), called

Laplace equation:

$$ \Delta T = \frac{\partial^2 T}{\partial x^2} + \frac{\partial^2 T}{\partial y^2} = f(x,y), \quad x,y\in \Omega\equiv[0,1]^2, $$$$ T_{\partial\Omega} = 0. $$

Discretization

$$\frac{\partial^2 T}{\partial x^2} \approx \frac{T(x+h) + T(x-h) - 2T(x)}{h^2} + \mathcal{O}(h^2),$$

same for $\frac{\partial^2 T}{\partial y^2},$ and we get a linear system.
First, let us consider one-dimensional case:

After the discretization of the one-dimensional Laplace equation with Dirichlet boundary conditions we have

$$\frac{u_{i+1} + u_{i-1} - 2u_i}{h^2} = f_i,\quad i=1,\dots,N-1$$$$ u_{0} = u_N = 0$$

or in the matrix form

$$ A u = f,$$

and (for $n = 5$) $$A=-\frac{1}{h^2}\begin{bmatrix} 2 & -1 & 0 & 0 & 0\\ -1 & 2 & -1 & 0 &0 \\ 0 & -1 & 2& -1 & 0 \\ 0 & 0 & -1 & 2 &-1\\ 0 & 0 & 0 & -1 & 2 \end{bmatrix}$$

The matrix is triadiagonal and sparse
(and also Toeplitz: all elements on the diagonal are the same)

Block structure in 2D

In two dimensions, we get equation of the form

$$-\frac{4u_{ij} -u_{(i-1)j} - u_{(i+1)j} - u_{i(j-1)}-u_{i(j+1)}}{h^2} = f_{ij},$$

or in the Kronecker product form

$$\Delta_{2D} = \Delta_{1D} \otimes I + I \otimes \Delta_{1D},$$

where $\Delta_{1D}$ is a 1D Laplacian, and $\otimes$ is a Kronecker product of matrices.

For matrices $A\in\mathbb{R}^{n\times m}$ and $B\in\mathbb{R}^{l\times k}$ its Kronecker product is defined as a block matrix of the form

$$ A\otimes B = \begin{bmatrix}a_{11}B & \dots & a_{1m}B \\ \vdots & \ddots & \vdots \\ a_{n1}B & \dots & a_{nm}B\end{bmatrix}\in\mathbb{R}^{nl\times mk}. $$

In the block matrix form the 2D-Laplace matrix can be written in the following form:

$$A = -\frac{1}{h^2}\begin{bmatrix} \Delta_1 + 2I & -I & 0 & 0 & 0\\ -I & \Delta_1 + 2I & -I & 0 &0 \\ 0 & -I & \Delta_1 + 2I & -I & 0 \\ 0 & 0 & -I & \Delta_1 + 2I &-I\\ 0 & 0 & 0 & -I & \Delta_1 + 2I \end{bmatrix}$$
Short list of Kronecker product properties

Sparse matrices help in computational graph theory

Florida sparse matrix collection

More sparse matrices you can find in Florida sparse matrix collection which contains all sorts of matrices for different applications.

Sparse matrices and deep learning

Sparse matrix: construction

Please note the following functions

Sparsity pattern

Sparse matrix: definition

What we need to find out to see how it actually works

Sparse matrix storage

There are many storage formats, important ones:

In scipy there are constructors for each of these formats, e.g.

scipy.sparse.lil_matrix(A).

Coordinate format (COO)

The simplest format is to use coordinate format to represent the sparse matrix as positions and values of non-zero elements.

i, j, val

where i, j are integer array of indices, val is the real array of matrix elements.
So we need to store $3\cdot$ nnz elements, where nnz denotes number of nonzeroes in the matrix.

Q: What is good and what is bad in this format?

Main disadvantages

First two disadvantages are solved by compressed sparse row (CSR) format.

Compressed sparse row (CSR)

In the CSR format a matrix is stored as 3 different arrays:

ia, ja, sa

where:

So, we got $2\cdot{\bf nnz} + n+1$ elements.

Sparse matrices in PyTorch and Tensorflow

CSR helps in sparse matrix by vector product (SpMV)

for i in range(n):

        for k in range(ia[i]:ia[i+1]):

            y[i] += sa[k] * x[ja[k]]

Let us do a short timing test

As you see, CSR is faster, and for more unstructured patterns the gain will be larger.

Sparse matrices and efficiency

Recall how we measure efficiency of linear algebra operations

The standard way to measure the efficiency of linear algebra operations on a particular computing architecture is to use flops (number of floating point operations per second)

We can measure peak efficiency of an ordinary matrix-by-vector product.

Random data access and cache misses

Q: what if cache does not have free space?

Cache scheme and LRU (Least recently used)

CSR sparse matrix by vector product

for i in range(n):

        for k in range(ia[i]:ia[i+1]):

            y[i] += sa[k] * x[ja[k]]

Reordering reduces cache misses

Example

Sparse transpose matrix-by-vector product

Compressed sparse block (CSB)

Solve linear systems with sparse matrices

Let us start with small demo of solving sparse linear system...

Take home message

Questions?