MPI (Message Passing Interface) is a standard for parallel computing in distributed memory
Skoltech owns supercomputer, which is on the 7th place in the list of TOP50 supercomputers from Commonwealth of Independent States (see the rating here, sorry for only in Russian)
For dense linear algebra problems, we are limited by the memory to store the full matrix, it is $N^2$ parameters.
The class of sparse matrices where most of the elements are zero, allows us at least to store such matrices.
The question if we can:
with sparse matrices
Now we will talk about sparse matrices, where they arise, how we store them, how we operate with them.
Sparse matrices arise in:
The simplest partial differential equation (PDE), called
Laplace equation:
$$ \Delta T = \frac{\partial^2 T}{\partial x^2} + \frac{\partial^2 T}{\partial y^2} = f(x,y), \quad x,y\in \Omega\equiv[0,1]^2, $$$$ T_{\partial\Omega} = 0. $$same for $\frac{\partial^2 T}{\partial y^2},$
and we get a linear system.
First, let us consider one-dimensional case:
After the discretization of the one-dimensional Laplace equation with Dirichlet boundary conditions we have
$$\frac{u_{i+1} + u_{i-1} - 2u_i}{h^2} = f_i,\quad i=1,\dots,N-1$$$$ u_{0} = u_N = 0$$or in the matrix form
$$ A u = f,$$and (for $n = 5$) $$A=-\frac{1}{h^2}\begin{bmatrix} 2 & -1 & 0 & 0 & 0\\ -1 & 2 & -1 & 0 &0 \\ 0 & -1 & 2& -1 & 0 \\ 0 & 0 & -1 & 2 &-1\\ 0 & 0 & 0 & -1 & 2 \end{bmatrix}$$
The matrix is triadiagonal and sparse
(and also Toeplitz: all elements on the diagonal are the same)
In two dimensions, we get equation of the form
$$-\frac{4u_{ij} -u_{(i-1)j} - u_{(i+1)j} - u_{i(j-1)}-u_{i(j+1)}}{h^2} = f_{ij},$$or in the Kronecker product form
$$\Delta_{2D} = \Delta_{1D} \otimes I + I \otimes \Delta_{1D},$$where $\Delta_{1D}$ is a 1D Laplacian, and $\otimes$ is a Kronecker product of matrices.
For matrices $A\in\mathbb{R}^{n\times m}$ and $B\in\mathbb{R}^{l\times k}$ its Kronecker product is defined as a block matrix of the form
$$ A\otimes B = \begin{bmatrix}a_{11}B & \dots & a_{1m}B \\ \vdots & \ddots & \vdots \\ a_{n1}B & \dots & a_{nm}B\end{bmatrix}\in\mathbb{R}^{nl\times mk}. $$In the block matrix form the 2D-Laplace matrix can be written in the following form:
$$A = -\frac{1}{h^2}\begin{bmatrix} \Delta_1 + 2I & -I & 0 & 0 & 0\\ -I & \Delta_1 + 2I & -I & 0 &0 \\ 0 & -I & \Delta_1 + 2I & -I & 0 \\ 0 & 0 & -I & \Delta_1 + 2I &-I\\ 0 & 0 & 0 & -I & \Delta_1 + 2I \end{bmatrix}$$More sparse matrices you can find in Florida sparse matrix collection which contains all sorts of matrices for different applications.
from IPython.display import IFrame
IFrame("http://yifanhu.net/GALLERY/GRAPHS/search.html", width=700, height=450)
We can create sparse matrix using scipy.sparse package (actually this is not the best sparse matrix package)
We can go to really large sizes (at least, to store this matrix in the memory)
Please note the following functions
spdiags
kron
import numpy as np
import scipy as sp
import scipy.sparse
from scipy.sparse import csc_matrix, csr_matrix
import matplotlib.pyplot as plt
import scipy.linalg
import scipy.sparse.linalg
%matplotlib inline
n = 5
ex = np.ones(n);
lp1 = sp.sparse.spdiags(np.vstack((ex, -2*ex, ex)), [-1, 0, 1], n, n, 'csr');
e = sp.sparse.eye(n)
A = sp.sparse.kron(lp1, e) + sp.sparse.kron(e, lp1)
A = csc_matrix(A)
plt.spy(A, aspect='equal', marker='.', markersize=5)
The spy
command plots the sparsity pattern of the matrix: the $(i, j)$ pixel is drawn, if the corresponding matrix element is non-zero.
Sparsity pattern is really important for the understanding the complexity of the sparse linear algebra algorithms.
Often, only the sparsity pattern is needed for the analysis of "how complex" the matrix is.
The definition of "sparse matrix" is that the number of non-zero elements is much less than the total number of elements.
You can do the basic linear algebra operations (like solving linear systems at the first place) faster, than if working for with the full matrix.
Question 1: How to store the sparse matrix in memory?
Question 2: How to multiply sparse matrix by vector fast?
Question 3: How to solve linear systems with sparse matrices fast?
There are many storage formats, important ones:
In scipy
there are constructors for each of these formats, e.g.
scipy.sparse.lil_matrix(A)
.
The simplest format is to use coordinate format to represent the sparse matrix as positions and values of non-zero elements.
i, j, val
where i, j
are integer array of indices, val
is the real array of matrix elements.
So we need to store $3\cdot$ nnz elements, where nnz denotes number of nonzeroes in the matrix.
Q: What is good and what is bad in this format?
First two disadvantages are solved by compressed sparse row (CSR) format.
In the CSR format a matrix is stored as 3 different arrays:
ia, ja, sa
where:
So, we got $2\cdot{\bf nnz} + n+1$ elements.
for i in range(n):
for k in range(ia[i]:ia[i+1]):
y[i] += sa[k] * x[ja[k]]
Let us do a short timing test
import numpy as np
import scipy as sp
import scipy.sparse
import scipy.sparse.linalg
from scipy.sparse import csc_matrix, csr_matrix, coo_matrix
import matplotlib.pyplot as plt
%matplotlib inline
n = 1000
ex = np.ones(n);
lp1 = sp.sparse.spdiags(np.vstack((ex, -2*ex, ex)), [-1, 0, 1], n, n, 'csr');
e = sp.sparse.eye(n)
A = sp.sparse.kron(lp1, e) + sp.sparse.kron(e, lp1)
A = csr_matrix(A)
rhs = np.ones(n * n)
B = coo_matrix(A)
%timeit A.dot(rhs)
%timeit B.dot(rhs)
As you see, CSR is faster, and for more unstructured patterns the gain will be larger.
The standard way to measure the efficiency of linear algebra operations on a particular computing architecture is to use flops (number of floating point operations per second)
We can measure peak efficiency of an ordinary matrix-by-vector product.
import numpy as np
import time
n = 4000
a = np.random.randn(n, n)
v = np.random.randn(n)
t = time.time()
np.dot(a, v)
t = time.time() - t
print('Time: {0: 3.1e}, Efficiency: {1: 3.1e} Gflops'.\
format(t, ((2 * n ** 2)/t) / 10 ** 9))
n = 400
ex = np.ones(n);
a = sp.sparse.spdiags(np.vstack((ex, -2*ex, ex)), [-1, 0, 1], n, n, 'csr');
rhs = np.random.randn(n)
t = time.time()
a.dot(rhs)
t = time.time() - t
print('Time: {0: 3.1e}, Efficiency: {1: 3.1e} Gflops'.\
format(t, (3 * n) / t / 10 ** 9))
Q: what if cache does not have free space?
for i in range(n):
for k in range(ia[i]:ia[i+1]):
y[i] += sa[k] * x[ja[k]]
ja
stores sequential elements, then they will be moved to cache altogether and number of cache misses decreasesWe can convert given sparse matrix to banded or block diagonal with permutations
Let $P$ be row permutation matrix and $Q$ be column permutation matrix
Let us start with small demo of solving sparse linear system...
n = 10
ex = np.ones(n);
lp1 = sp.sparse.spdiags(np.vstack((ex, -2*ex, ex)), [-1, 0, 1], n, n, 'csr');
e = sp.sparse.eye(n)
A = sp.sparse.kron(lp1, e) + sp.sparse.kron(e, lp1)
A = csr_matrix(A)
rhs = np.ones(n * n)
sol = sp.sparse.linalg.spsolve(A, rhs)
_, (ax1, ax2) = plt.subplots(1, 2)
ax1.plot(sol)
ax1.set_title('Not reshaped solution')
ax2.contourf(sol.reshape((n, n), order='f'))
ax2.set_title('Reshaped solution')
Now we start discussion of direct methods for solving sparse linear systems based on LU decomposition...
Why sparse linear systems can be solved faster, what is the technique?
In the LU factorization of the matrix $A$ the factors $L$ and $U$ can be also sparse:
Note that the inverse matrix is not sparse!
n = 7
ex = np.ones(n);
a = sp.sparse.spdiags(np.vstack((ex, -2*ex, ex)), [-1, 0, 1], n, n, 'csr');
b = np.array(np.linalg.inv(a.toarray()))
print(a.toarray())
print(b)
$L$ and $U$ are typically sparse. In the tridiagonal case they are even bidiagonal!
from scipy.sparse.linalg import splu
T = splu(a.tocsc(), permc_spec="NATURAL")
print(T.L.toarray())
Interesting to note that splu
without permc_spec
will produce permutations which will not yield the bidiagonal factor:
from scipy.sparse.linalg import splu
T = splu(a.tocsc())
print(T.L.todense())
print(T.perm_c)
In a 2D case everything is much worse:
n = 20
ex = np.ones(n);
lp1 = sp.sparse.spdiags(np.vstack((ex, -2*ex, ex)), [-1, 0, 1], n, n, 'csr');
e = sp.sparse.eye(n)
A = sp.sparse.kron(lp1, e) + sp.sparse.kron(e, lp1)
A = csc_matrix(A)
T = scipy.sparse.linalg.spilu(A)
plt.spy(T.L, marker='.', color='k', markersize=8)
For correct permutation in 2D case the number of nonzeros in $L$ factor grows as $\mathcal{O}(N \log N)$. But complexity is $\mathcal{O}(N^{3/2})$.
The number of nonzeros in LU decomposition has a deep connection to the graph theory.
networkx package
can be used to visualize graphs, given only the adjacency matrix.
It may even recover to some extend the graph structure.
import networkx as nx
n = 10
ex = np.ones(n);
lp1 = sp.sparse.spdiags(np.vstack((ex, -2*ex, ex)), [-1, 0, 1], n, n, 'csr');
e = sp.sparse.eye(n)
A = sp.sparse.kron(lp1, e) + sp.sparse.kron(e, lp1)
A = csc_matrix(A)
G = nx.Graph(A)
nx.draw(G, pos=nx.spectral_layout(G), node_size=10)
The fill-in of a matrix are those entries which change from an initial zero to a nonzero value during the execution of an algorithm.
The fill-in is different for different permutations. So, before factorization we need to find reordering which produces the smallest fill-in.
Example
$$A = \begin{bmatrix} * & * & * & * & *\\ * & * & 0 & 0 & 0 \\ * & 0 & * & 0 & 0 \\ * & 0 & 0& * & 0 \\ * & 0 & 0& 0 & * \end{bmatrix} $$If we eliminate elements from the top to the bottom, then we will obtain dense matrix. However, we could maintain sparsity if elimination was done from the bottom to the top.
Reordering the rows and the columns of the sparse matrix in order to reduce the number of nonzeros in $L$ and $U$ factors is called fill-in minimization.
Examples of algorithms for fill-in minimization: