hw2

1. LU for band matrices (7 pts)¶

The complexity to find an LU decomposition of a dense $n\times n$ matrix is $\mathcal{O}(n^3)$. Significant reduction in complexity can be achieved if the matrix has a certain structure, e.g. it is sparse. In the following task we consider an important example of $LU$ for a special type of sparse matrices –– band matrices with the bandwidth $m$ equal to 3 or 5 which called tridiagonal and pentadiagonal respectively.

(5 pts) Write a function band_lu(diag_broadcast, n) which computes LU decomposition for tridiagonal or pentadiagonal matrix with given diagonal values. For example, input parametres (diag_broadcast = [4,-2,1], n = 4) mean that we need to find LU decomposition for the triangular matrix of the form:

$$A = \begin{pmatrix} -2 & 1 & 0 & 0\\ 4 & -2 & 1 & 0 \\ 0 & 4 & -2 & 1 \\ 0 & 0 & 4 & -2 \\ \end{pmatrix}.$$

As an output it is considered to make L and U - 2D arrays representing diagonals in factors $L$ (L[0] keeps first lower diagonal, L[1] keeps second lower, ...), and $U$ (U[:,0] keeps main diagonal, U[:,1] keeps first upper, ...).

(2 pts) Compare execution time of the band LU decomposition using standard function from scipy, i.e. which takes the whole matrix and does not know about its special structure, and band decomposition of yours implementation. Comment on the results.

2. Stability of LU (8 pts)¶

Let $A = \begin{pmatrix} \varepsilon & 1 & 0\\ 1 & 1 & 1 \\ 0 & 1 & 1 \end{pmatrix}.$

(5 pts) Find analytically LU decomposition with and without pivoting for the matrix $A$.
(3 pts) Explain, why can the LU decomposition fail to approximate factors $L$ and $U$ for $|\varepsilon|\ll 1$ in computer arithmetic?

3. Block LU (10 pts)¶

Let $A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}$ be a block matrix. The goal is to solve the linear system

$$ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} = \begin{bmatrix} f_1 \\ f_2 \end{bmatrix}. $$

(2 pts) Using block elimination find matrix $S$ and right-hand side $\hat{f_2}$ so that $u_2$ can be found from $S u_2 = \hat{f_2}$. Note that the matrix $S$ is called Schur complement of the block $A_{11}$.
(4 pts) Using Schur complement properties prove that

$$\det(X+AB) = \det(X)\det(I+BX^{-1}A), $$

where $X$ - nonsingular square matrix.

(4 pts) Let matrix $F \in \mathbb{R}^{m \times n}$ and $G \in \mathbb{R}^{n \times m}$. Prove that

$$\det(I_m - FG) = \det(I_n - GF).$$

4. Efficient implementation of LU decomposition (10 pts)¶

In the lecture we provide naive implementation of LU factorization with loops and elementwise update of factors. In this subproblem we ask you to provide more efficient implementation of LU factorization and explain how you derive this implementation (main ideas and how you use them in this particular case).

(1 pts) Main idea for speed up computation of LU factorization (without using built-in function!)
(4 pts) Implement the presented approach to speed up LU
(1 pts) Illustrate numerically that your implementation is correct
(2 pts) Provide the comparison of running time of naive implementation from the lecture, your implementation and NumPy built-in function for range of matrix dimensions. We expect you plot the running time vs matrix dimension for these implementations. So you should get the plot with three lines.
(2 pts) Discuss the obtained results and explain what other tricks are possible to accelerate computing the LU factorization.

NumPy or JAX are both ok in this subproblem, but please use the single library for all implementations.

Problem 2 (eigenvalues) 55 pts¶

1. Theoretical tasks (10 pts)¶

(5 pts) Prove that normal matrix is Hermitian iff its eigenvalues are real. Prove that normal matrix is unitary iff its eigenvalues satisfy $|\lambda| = 1$.
(5 pts) The following problem illustrates instability of the Jordan form. Find theoretically the eigenvalues of the perturbed Jordan block:

$$ J(\varepsilon) = \begin{bmatrix} \lambda & 1 & & & 0 \\ & \lambda & 1 & & \\ & & \ddots & \ddots & \\ & & & \lambda & 1 \\ \varepsilon & & & & \lambda \\ \end{bmatrix}_{n\times n} $$

Comment how eigenvalues of $J(0)$ are perturbed for large $n$.

2. PageRank (35 pts)¶

Damping factor importance¶

(5 pts) Write the function pagerank_matrix(G) that takes an adjacency matrix $G$ (in both sparse and dense formats) as an input and outputs the corresponding PageRank matrix $A$.

(3 pts) Find PageRank matrix $A$ that corresponds to the following graph: What is its largest eigenvalue? What multiplicity does it have?

(5 pts) Implement the power method for a given matrix $A$, an initial guess $x_0$ and a number of iterations num_iter. It should be organized as a function power_method(A, x0, num_iter) that outputs approximation to eigenvector $x$, eigenvalue $\lambda$ and history of residuals $\{\|Ax_k - \lambda_k x_k\|_2\}$. Make sure that the method converges to the correct solution on a matrix $\begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix}$ which is known to have the largest eigenvalue equal to $3$.

(2 pts) Run the power method for the graph presented above and plot residuals $\|Ax_k - \lambda_k x_k\|_2$ as a function of $k$ for num_iter=100 and random initial guess x0. Explain the absence of convergence.

(2 pts) Consider the same graph, but with additional self loop at node 4 (self loop is an edge that connects a vertex with itself). Plot residuals as in the previous task and discuss the convergence. Now, run the power method with num_iter=100 for 10 different initial guesses and print/plot the resulting approximated eigenvectors. Why do they depend on the initial guess?

In order to avoid this problem Larry Page and Sergey Brin proposed to use the following regularization technique:

$$ A_d = dA + \frac{1-d}{N} \begin{pmatrix} 1 & \dots & 1 \\ \vdots & & \vdots \\ 1 & \dots & 1 \end{pmatrix}, $$

where $d$ is a small parameter in $[0,1]$ (typically $d=0.85$), which is called damping factor, $A$ is of size $N\times N$. Now $A_d$ is the matrix with multiplicity of the largest eigenvalue equal to 1. Recall that computing the eigenvector of the PageRank matrix, which corresponds to the largest eigenvalue, has the following interpretation. Consider a person who stays in a random node of a graph (i.e. opens a random web page); at each step s/he follows one of the outcoming edges uniformly at random (i.e. opens one of the links). So the person randomly walks through the graph and the eigenvector we are looking for is exactly his/her stationary distribution â€” for each node it tells you the probability of visiting this particular node. Therefore, if the person has started from a part of the graph which is not connected with the other part, he will never get there. In the regularized model, the person at each step follows one of the outcoming links with probability $d$ OR teleports to a random node from the whole graph with probability $(1-d)$.

(2 pts) Now, run the power method with $A_d$ and plot residuals $\|A_d x_k - \lambda_k x_k\|_2$ as a function of $k$ for $d=0.97$, num_iter=100 and a random initial guess x0.
(5 pts) Find the second largest in the absolute value eigenvalue of the obtained matrix $A_d$. How and why is it connected to the damping factor $d$? What is the convergence rate of the PageRank algorithm when using damping factor?

Usually, graphs that arise in various areas are sparse (social, web, road networks, etc.) and, thus, computation of a matrix-vector product for corresponding PageRank matrix $A$ is much cheaper than $\mathcal{O}(N^2)$. However, if $A_d$ is calculated directly, it becomes dense and, therefore, $\mathcal{O}(N^2)$ cost grows prohibitively large for big $N$.

(2 pts) Implement fast matrix-vector product for $A_d$ as a function pagerank_matvec(A, d, x), which takes a PageRank matrix $A$ (in sparse format, e.g., csr_matrix), damping factor $d$ and a vector $x$ as an input and returns $A_dx$ as an output.
(1 pts) Generate a random adjacency matrix of size $10000 \times 10000$ with only 100 non-zero elements and compare pagerank_matvec performance with direct evaluation of $A_dx$.

DBLP: computer science bibliography¶

Download the dataset from here, unzip it and put dblp_authors.npz and dblp_graph.npz in the same folder with this notebook. Each value (author name) from dblp_authors.npz corresponds to the row/column of the matrix from dblp_graph.npz. Value at row i and column j of the matrix from dblp_graph.npz corresponds to the number of times author i cited papers of the author j. Let us now find the most significant scientists according to PageRank model over DBLP data.

(4 pts) Load the weighted adjacency matrix and the authors list into Python using load_dblp(...) function. Print its density (fraction of nonzero elements). Find top-10 most cited authors from the weighted adjacency matrix. Now, make all the weights of the adjacency matrix equal to 1 for simplicity (consider only existence of connection between authors, not its weight). Obtain the PageRank matrix $A$ from the adjacency matrix and verify that it is stochastic.

(1 pts) In order to provide pagerank_matvec to your power_method (without rewriting it) for fast calculation of $A_dx$, you can create a LinearOperator:
```
L = scipy.sparse.linalg.LinearOperator(A.shape, matvec=lambda x, A=A, d=d: pagerank_matvec(A, d, x))
```
Calling L@x or L.dot(x) will result in calculation of pagerank_matvec(A, d, x) and, thus, you can plug $L$ instead of the matrix $A$ in the power_method directly. Note: though in the previous subtask graph was very small (so you could disparage fast matvec implementation), here it is very large (but sparse), so that direct evaluation of $A_dx$ will require $\sim 10^{12}$ matrix elements to store - good luck with that (^_<).

(2 pts) Run the power method starting from the vector of all ones and plot residuals $\|A_dx_k - \lambda_k x_k\|_2$ as a function of $k$ for $d=0.85$.

(1 pts) Print names of the top-10 authors according to PageRank over DBLP when $d=0.85$. Comment on your findings.

3. QR algorithm (10 pts)¶

Implement QR-algorithm without shifting. Prototype of the function is given below

Symmetric case (3 pts)¶

Create symmetric tridiagonal $11 \times 11$ matrix with elements $-1, 2, -1$ on sub-, main- and upper diagonal respectively without using loops.
Run $400$ iterations of the QR algorithm for this matrix.
Plot the output matrix with function plt.spy(Ak, precision=1e-7).
Plot convergence of QR-algorithm.

Nonsymmetric case (4 pts)¶

Create nonsymmetric tridiagonal $11 \times 11$ matrix with elements $5, 3, -2$ on sub-, main- and upper diagonal respectively without using loops.
Run $250$ iterations of the QR algorithm for this matrix.
Plot the result matrix with function plt.spy(Ak, precision=1e-7). Is this matrix lower triangular? How does this correspond to the claim about convergence of the QR algorithm?

Problem 3. (Pseudo-Schur decomposition) 15 pts¶

Let's redefine scalar product $ \forall x, y \in \mathbb{C}^n$ in a following way:

$$ [x,y]_J = y^{*}Jx, \text{s.t.}\ J = \text{diag}(j_{11}, j_{22}, \dots, j_{nn})\ \text{and}\ j_{ii} = \pm1\ \forall i \in [1,n].$$

Denote rows of matrix $V \in \mathbb{C}^{n \times n}$ as $v_1, v_2, \dots, v_n$. Then $V$ is called $\textbf{J-orthonormal}$ iff

$$[v_i, v_k]_J = \pm \delta_{ik}.$$

We will call matrix $T \in \mathbb{C}^{n \times n}$ $\textbf{almost triangular}$ iff $T$ is upper triangular with diagonal blocks of order $1$ or $2$.

Matrix $A \in \mathbb{C}^{n \times n}$ is said to be $\textbf{J-decomposable}$ if exist J-orthonormal matrix $V$ and upper triangular matrix $T$ such that

$$A = V T V^{-1}.$$

Matrix $A \in \mathbb{C}^{n \times n}$ is said to have $\textbf{pseudoschur J-decomposition}$ if exist J-orthonormal matrix $V$ and almost triangular matrix $T$ such that

$$A = V T V^{-1}.$$

This problem is to get familiar with the fact that two abovementioned decompositions exist not for any square matrix with complex entries.

(2 pts) $A$ has pseudoschur J-decomposition $A = V T V^{-1}, \ \text{where}\ V = \begin{bmatrix} v_1 & v_2 & \ldots & v_n \end{bmatrix}, \ T = \begin{bmatrix} T_{ij} \end{bmatrix}$ and $v_1$ is $\textbf{not}$ an eigenvector of $A$. Show that $T_{21} \ne 0$.

(5 pts) Given $J = \text{diag}(1, -1)$ and $A = \begin{bmatrix} 3 \ -1\\ -1\ \ 3\\ \end{bmatrix}$, prove that $A$ is not J-decomposable.

(8 pts) Given that $A \in \mathbb{C}^{n \times n}$ is diagonalizable, show that it has pseudoschur J-decomposition for any $J$ of form $J=\text{diag}(\pm 1, \dots, \pm 1)$. Note that in order to solve that subproblem you should firstly prove the following fact:

Let $S \in \mathbb{C}^{m \times n},\ m \ge n,\ J = \text{diag}(\pm 1).$ If $A = S^{*}JS$ and $det(A) \ne 0$, then exists QR decomposition of $S$ with respect to $J$: $$S = P_1 QR P_2^{*} = P_1 Q \begin{bmatrix} R_1 \\ 0 \end{bmatrix} P_2^{*}, \ Q^{*} J^{'}Q = J^{'}, \ J^{'} = P_1^{*}JP_1,$$ where $P_1$ and $P_2$ are permutation matrices, $Q$ is called $J^{'}$- unitary and $R_1$ is almost triangular.

Problem 4. (Skeleton decomposition) 28 pts¶

The application that we are particularly interested in is the approximation of a given matrix by a low-rank matrix:

$$ A \approx UV^T, A \in \mathbb{R}^{m \times n}, U \in \mathbb{R}^{m \times r}, V \in \mathbb{R}^{n \times r}.$$

It is well known that the best (in any unitary invariant norm) low-rank approximation can be computed via singular value decomposition (SVD). As an alternative, we can consider skeleton decompostion of the form:

$$A \approx A_r = A(: , \mathcal{J})A(\mathcal{I} , \mathcal{J})^{-1}A(\mathcal{I} , :),$$

where $\mathcal{I,J}$ are some index sets of length $r$.

Below we consider the matrix $N \times N$ derived from the following function discretization in the uniform grid in square $[0, 1] \times [0, 1]$: $ f(x, y) = \exp(-\sqrt{x^2 + y^2})$. It means that $A = [a_{ij}]$, where $a_{ij} = f(x_i, x_j)$ and $x_i = i / (N-1)$, $x_j = j / (N-1)$, where $i, j = 0,\ldots, N-1$.

(2 pts) compose this matrix for $N = 5000$ without loops (Hint: function np.meshgrid can help you)

(3 pts) Compute Skeleton approximation with random selection of rows and columns indices for $r = 5$ (check that submatrix in the intersection of rows and columns is nonsingular). Average the relative error $\frac{\|A - A_r \|_F}{\|A\|_F}$ over $M$ samples of column/row indices. Check that $M$ is sufficiently large to provide stable mean.

Note: extracting submatrices should be done according to numpy e.g. A[index_set, :] to extract selected rows, A[:, index_set] to extract selected columns etc..

As you should know from the lecture, if $A$ is of rank $r$ and $\hat{A} = A(\mathcal{I} , \mathcal{J})$ is nonsingular, then the exact equality holds. In the approximate case, however, the quality of the approximation depends on the volume of the submatrix $\hat{A}$:

Theorem

If $\hat{A} = A_{max}$ has maximal in modulus determinant among all $r \times r$ submatrices of $A$, the following error etimate holds:

$$ \|A - A_r\|_1 \leq (r+1)\sigma_{r+1}.$$

And the question is how to choose a good submatrix of nearly maximal volume in practice.

Definition: We call $r \times r$ submatrix $A_{dom}$ of rectangular $n \times r$ matrix $A$ of full rank dominant, if all the entries of $AA_{dom}^{-1}$ are not greater than $1$ in modulus.

The crucial theoretical result behind the scene is that the volume of any dominant submatrix $A_{dom}$ can not be very much smaller than the maximum volume submatrix $A_{max}$ (without proof).

We provide the following algorithm for constructing dominant submatrix of a tall matrix.

Algorithm 1:

Given matrix $A$ of size $n \times r$ finds dominant submatrix of size $r \times r$

step 0. Start with arbitrary nonsingular $r \times r$ submatrix $A_{dom}$. Reorder rows in $A$ so that $A_{dom}$ occupies first $r$ rows in $A$.

step 1. Compute $B = AA_{dom}^{-1}$ and find its maximum in module entry $b_{ij}$.

step 2. If $|b_{ij}| > 1 + \delta$, then:

Swap rows $i$ and $j$ in $B$ (accrodignly in A). By swapping the rows we have increased the volume of the upper submatrix in $B$, as well as in $A$ (why?). Let $A_{dom}$ be the new upper submatrix of $A$ and go to step 1.

elseif $|b_{ij}| < 1 + \delta$:

return $A_{dom}$.

Note: $\delta = 10^{-2}$ seems to be a good practical choice.

Problem set 2 (35 + 55 + 15 + 28 = 133 pts)¶

Problem 1 (LU decomposition) 35 pts¶