hw2

1. LU for band matrices and Cholesky decomposition (13 pts)¶

The complexity to find an LU decomposition of a dense $n\times n$ matrix is $\mathcal{O}(n^3)$. Significant reduction in complexity can be achieved if the matrix has a certain structure, e.g. it is sparse. In the following task we consider an important example of $LU$ for a special type of matrices –– band matrices with top left entry equal to 1 and the bandwidth $m$ equal to 3 or 5 which called tridiagonal and pentadiagonal respectively. The bands may be [1, 2, 1] and [1, 1, 2, 1, 1] respectively

(4 pts) Write a function band_lu(diag_broadcast, n) which computes LU decomposition for tridiagonal or pentadiagonal matrix with top left entry equal to 1 with given diagonal bands. For example, input parametres (diag_broadcast = [1,2,1], n = 4) mean that we need to find LU decomposition for the triangular matrix of the form:

$$A = \begin{pmatrix} 1 & 1 & 0 & 0\\ 1 & 2 & 1 & 0 \\ 0 & 1 & 2 & 1 \\ 0 & 0 & 1 & 2 \\ \end{pmatrix}.$$

Provide the extensive testing of the implemented function that will works correctly for large $n$, e.g. $n=100$. As an output it is considered to make L and U - 2D arrays representing diagonals in factors $L$ (L[0] keeps first lower diagonal, L[1] keeps second lower, ...), and $U$ (U[:,0] keeps main diagonal, U[:,1] keeps first upper, ...).

(2 pts) Compare execution time of the band LU decomposition using standard function from scipy, i.e. which takes the whole matrix and does not know about its special structure, and band decomposition of yours implementation. Comment on the results.
(7 pts) Write a function cholesky(n) for computing Cholesky decomposition. It should take the the single argument - the matrix that will be factorized and return the single output - lower-triangular factor $L$. Think about the efficiency of your implementation and if necessary update it to achieve the best performance (eliminate Python loops, where it is possible and so on). Explicitly describe the difference with LU decomposition that reduces the complexity from $2n^3/3$ for LU to $n^3/3$ for Cholesky. Test the implemented function on the Pascal matrix of given size $n$ for $n = 5, 10, 50$. Pascal matrix is square matrix of the following form (here for $n=4$) $$P = \begin{pmatrix} 1 & 1 & 1 & 1\\ 1 & 2 & 3 & 4 \\ 1 & 3 & 6 & 10 \\ 1 & 4 & 10 & 20 \\ \end{pmatrix}.$$

Here you can find more details about such matrices and analytical form for factor $L$ from Cholesky decomposition. Compare the result of your implementation with analytical expression in terms of some matrix norm of difference.

In [ ]:

from scipy.sparse import diags # can be used with broadcasting of scalars if desired dimensions are large
import numpy as np

# INPUT : diag_broadcast - list of diagonals value to broadcast,length equal to 3 or 5; n - integer, band matrix shape.
# OUTPUT : L - 2D np.ndarray, L.shape[0] depends on bandwidth, L.shape[1] = n-1, do not store main diagonal, where all ones;                  add zeros to the right side of rows to handle with changing length of diagonals.
#          U - 2D np.ndarray, U.shape[0] = n, U.shape[1] depends on bandwidth;
#              add zeros to the bottom of columns to handle with changing length of diagonals.
def band_lu(diag_broadcast, n):
    # enter your code here
    raise NotImplementedError()
    
def cholesky(A):
    # enter your code here
    raise NotImplementedError()

2. Stability of LU (8 pts)¶

(4 pts) Show, that for these matrices $A$ and $B$ LU decomposition fails. Why does it happen?

$ A = \begin{pmatrix} 0 & 1 \\ 2 & 3 \end{pmatrix}.$

$B = \begin{pmatrix} 1 & 1 & 0\\ 1 & 1 & 2 \\ 1 & 2 & 1 \end{pmatrix}.$

(4 pts) In the LU decomposition, a pivot position is a position of the element that identifies the row and column that will be eliminated in the current step. For example, first pivot in LU is usually the left top element. What value of $c$ leads to zero in the second pivot position? What $c$ produces zero in the third pivot position? What modification of LU should we use in order to address the possible zeros in pivot position?

$A = \begin{pmatrix} 1 & c & 0\\ 2 & 4 & 1 \\ 3 & 5 & 1 \end{pmatrix}.$

3. Implementation of PLU decomposition (14 pts)¶

As you have noticed before, LU decomposition may fail. In order to make it stable, we can use LU decomposition with pivoting (PLU).

We want to find such permutation matrix $P$ that LU decomposition of $PA$ exists

$$ PA = LU $$

(7 pts) Implement efficiently PLU decomposition (without loops and with appropriate level of BLAS operations). Also, pay attention to the way of permutation matrix storage.
(4 pts ) Compare your function for computing PLU with built-in function on matrices of such type (mirror_diag = [1,2,1], n = 4). (Bandwidth and matrix size may vary). So, you can pass them as dense 2D NumPy array and do not tune your implementation to this special structure. Compare them in terms of running time (use %timeit magic) for range of dimensions to recover the asymptotic rate of time increasing and in terms of acuracy. We expect you plot the running time vs matrix dimension for built-in function and your implementation. So you should get the plot with two lines. Consider additionally one of the pathological examples from above, where LU fails, but PLU has to work.

$$A = \begin{pmatrix} 0 & 0 & 1 & 1 \\ 0 &1 & 2 & 1 \\ 1 & 2 & 1 & 0\\ 1 & 2 & 0 & 0 \\ \end{pmatrix}.$$

(3 pts) Discuss the obtained results and explain how is it possible to accelerate computing the PLU factorization.

NumPy or JAX are both ok in this problem, but please use the single library for all implementations.

4. Block LU (10 pts)¶

Let $A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}$ be a block matrix. The goal is to solve the linear system

$$ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} = \begin{bmatrix} f_1 \\ f_2 \end{bmatrix}. $$

(2 pts) Using block elimination find matrix $S$ and right-hand side $\hat{f_2}$ so that $u_2$ can be found from $S u_2 = \hat{f_2}$. Note that the matrix $S$ is called Schur complement of the block $A_{11}$.
(4 pts) Using Schur complement properties prove that

$$\det(X+AB) = \det(X)\det(I+BX^{-1}A), $$

where $X$ - nonsingular square matrix.

(4 pts) Let matrix $F \in \mathbb{R}^{m \times n}$ and $G \in \mathbb{R}^{n \times m}$. Prove that

$$\det(I_m - FG) = \det(I_n - GF).$$

Problem 2 (eigenvalues) (50 pts)¶

1. Theoretical tasks (15 pts)¶

(2 pts) Prove that eigenvectors that correspond to distinct eigenvalues are linearly independent.
(3 pts) $A$ is a matrix such that $a_{i,j} \ge 0$ and $\sum_{j}a_{i,j} = 1$ (sum of the elements in each row is 1). Prove that $A$ has an eigenvalue $\lambda=1$ and that any eigenvalue $\lambda_i$: $|\lambda_i| \le 1$.
(5 pts) Prove that normal matrix is Hermitian iff its eigenvalues are real. Prove that normal matrix is unitary iff its eigenvalues satisfy $|\lambda| = 1$.
(5 pts) The following problem illustrates instability of the Jordan form. Find theoretically the eigenvalues of the perturbed Jordan block (there is only one $\varepsilon$ - in the left lower corner):

$$ J(\varepsilon) = \begin{bmatrix} \lambda & 1 & & & 0 \\ 0 & \lambda & 1 & & \\ & 0 & \ddots & \ddots & \\ & & 0 & \lambda & 1 \\ \varepsilon & & & 0 & \lambda \\ \end{bmatrix}_{n\times n} $$

Comment how eigenvalues of $J(0)$ are perturbed for large $n$.

2. PageRank (35 pts)¶

Damping factor importance¶

(5 pts) Write the function pagerank_matrix(G) that takes an adjacency matrix $G$ (in both sparse and dense formats) as an input and outputs the corresponding PageRank matrix $A$.

(3 pts) Find PageRank matrix $A$ that corresponds to the following graph: What is its largest eigenvalue? What multiplicity does it have?

(5 pts) Implement the power method for a given matrix $A$, an initial guess $x_0$ and a number of iterations num_iter. It should be organized as a function power_method(A, x0, num_iter) that outputs approximation to eigenvector $x$, eigenvalue $\lambda$ and history of residuals $\{\|Ax_k - \lambda_k x_k\|_2\}$. Make sure that the method converges to the correct solution on a matrix $\begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix}$ which is known to have the largest eigenvalue equal to $3$.

(2 pts) Run the power method for the graph presented above and plot residuals $\|Ax_k - \lambda_k x_k\|_2$ as a function of $k$ for num_iter=100 and random initial guess x0. Explain the absence of convergence.

(2 pts) Consider the same graph, but with additional self loop at node 4 (self loop is an edge that connects a vertex with itself). Plot residuals as in the previous task and discuss the convergence. Now, run the power method with num_iter=100 for 10 different initial guesses and print/plot the resulting approximated eigenvectors. Why do they depend on the initial guess?

In order to avoid this problem Larry Page and Sergey Brin proposed to use the following regularization technique:

$$ A_d = dA + \frac{1-d}{N} \begin{pmatrix} 1 & \dots & 1 \\ \vdots & & \vdots \\ 1 & \dots & 1 \end{pmatrix}, $$

where $d$ is a small parameter in $[0,1]$ (typically $d=0.85$), which is called damping factor, $A$ is of size $N\times N$. Now $A_d$ is the matrix with multiplicity of the largest eigenvalue equal to 1. Recall that computing the eigenvector of the PageRank matrix, which corresponds to the largest eigenvalue, has the following interpretation. Consider a person who stays in a random node of a graph (i.e. opens a random web page); at each step s/he follows one of the outcoming edges uniformly at random (i.e. opens one of the links). So the person randomly walks through the graph and the eigenvector we are looking for is exactly his/her stationary distribution â€” for each node it tells you the probability of visiting this particular node. Therefore, if the person has started from a part of the graph which is not connected with the other part, he will never get there. In the regularized model, the person at each step follows one of the outcoming links with probability $d$ OR teleports to a random node from the whole graph with probability $(1-d)$.

(2 pts) Now, run the power method with $A_d$ and plot residuals $\|A_d x_k - \lambda_k x_k\|_2$ as a function of $k$ for $d=0.97$, num_iter=100 and a random initial guess x0.
(5 pts) Find the second largest in the absolute value eigenvalue of the obtained matrix $A_d$. How and why is it connected to the damping factor $d$? What is the convergence rate of the PageRank algorithm when using damping factor?

Usually, graphs that arise in various areas are sparse (social, web, road networks, etc.) and, thus, computation of a matrix-vector product for corresponding PageRank matrix $A$ is much cheaper than $\mathcal{O}(N^2)$. However, if $A_d$ is calculated directly, it becomes dense and, therefore, $\mathcal{O}(N^2)$ cost grows prohibitively large for big $N$.

(2 pts) Implement fast matrix-vector product for $A_d$ as a function pagerank_matvec(A, d, x), which takes a PageRank matrix $A$ (in sparse format, e.g., csr_matrix), damping factor $d$ and a vector $x$ as an input and returns $A_dx$ as an output.
(1 pts) Generate a random adjacency matrix of size $10000 \times 10000$ with only 100 non-zero elements and compare pagerank_matvec performance with direct evaluation of $A_dx$.

DBLP: computer science bibliography¶

Download the dataset from here, unzip it and put dblp_authors.npz and dblp_graph.npz in the same folder with this notebook. Each value (author name) from dblp_authors.npz corresponds to the row/column of the matrix from dblp_graph.npz. Value at row i and column j of the matrix from dblp_graph.npz corresponds to the number of times author i cited papers of the author j. Let us now find the most significant scientists according to PageRank model over DBLP data.

(4 pts) Load the weighted adjacency matrix and the authors list into Python using load_dblp(...) function. Print its density (fraction of nonzero elements). Find top-10 most cited authors from the weighted adjacency matrix. Now, make all the weights of the adjacency matrix equal to 1 for simplicity (consider only existence of connection between authors, not its weight). Obtain the PageRank matrix $A$ from the adjacency matrix and verify that it is stochastic.

(1 pts) In order to provide pagerank_matvec to your power_method (without rewriting it) for fast calculation of $A_dx$, you can create a LinearOperator:
```
L = scipy.sparse.linalg.LinearOperator(A.shape, matvec=lambda x, A=A, d=d: pagerank_matvec(A, d, x))
```
Calling L@x or L.dot(x) will result in calculation of pagerank_matvec(A, d, x) and, thus, you can plug $L$ instead of the matrix $A$ in the power_method directly. Note: though in the previous subtask graph was very small (so you could disparage fast matvec implementation), here it is very large (but sparse), so that direct evaluation of $A_dx$ will require $\sim 10^{12}$ matrix elements to store - good luck with that (^_<).

(2 pts) Run the power method starting from the vector of all ones and plot residuals $\|A_dx_k - \lambda_k x_k\|_2$ as a function of $k$ for $d=0.85$.

(1 pts) Print names of the top-10 authors according to PageRank over DBLP when $d=0.85$. Comment on your findings.

Problem 3. QR algorithm (33 pts)¶

Implement QR-algorithm without shifts. Prototype of the function is given below

Symmetric case (3 pts)¶

Create symmetric tridiagonal $11 \times 11$ matrix with elements $-1, 2, -1$ on sub-, main- and upper diagonal respectively without using loops.
Run $400$ iterations of the QR algorithm for this matrix.
Plot the output matrix with function plt.spy(Ak, precision=1e-7).
Plot convergence of QR-algorithm.

Nonsymmetric case (5 pts)¶

Create nonsymmetric tridiagonal $11 \times 11$ matrix with elements $5, 3, -2$ on sub-, main- and upper diagonal respectively without using loops.
Run $250$ iterations of the QR algorithm for this matrix.
Plot the result matrix with function plt.spy(Ak, precision=1e-7). Is this matrix lower triangular? How does this correspond to the claim about convergence of the QR algorithm?

QR algorithms with Rayleigh Quotient shift (10 pts)¶

In the lectures the Rayleigh Quotient shift was introduced to speed up convergence of power method. Here we ask you to generalize this approach to construct the shifts in QR algorithm.

How to compute the Rayleigh Quotient shift in QR algorithm fast? Provide formulas and explanations how they can be simplified.
Implement explicit QR algorithm with Rayleigh Quotient shift. Please do not worry about implicit orthogonalization, we want to compare convergence only in terms of iterations.
Test your implementation in the symmetric case. Plot the convergence of QR algorithm with and without shift. Choose the dimension $n \sim 100 $ for more representative results.
How the convergence of the shifted algorithm compares to the simple QR? Why?

Try QR with Rayleigh Quotient shift for a simple matrix $A = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$. Does anything change from iteration to iteration? Does shift affect convergence here? Why?

QR with Wilkinson shift (15 pts)¶

To solve the problem that appears in the last example, we can use the Wilkinson shift:

$$\mu = a_m - \frac {sign(\delta) b^2_{m-1}} {(|\delta| + \sqrt{\delta^2 + b^2_{m-1}} )},$$

where $\delta = \frac{(a_{m-1} - a_m)}{2}$. If $\delta = 0$, then instead of $sign(\delta)$ you have to choose $1$ or $-1$ arbitrary. The numbers $a_m, b_{m-1}, a_{m-1}$ are taken from matrix $B$:

$$ B = \begin{bmatrix} a_{m-1} & b_{m-1} \\ b_{m-1} & a_m \\ \end{bmatrix}, $$

which is a lower right bottom submatrix of $A^{(k)}$. Here $k$ is an iteration counter in QR algorithm.

Compare convergence in the symmetric cases:
- distinctive eigenvalues
- two coincident eigenvalues
- maximum and minimum eigenvalues with the same absolute value Choose the dimension $n \sim 100 $ for more representative results. What do you observe?

Problem 4. (Movie Recommender system) 15 pts¶

Imagine the world without NLA where you have free evenings and you can watch movies!
But it is always hard to choose a movie to watch. In this problem we suggest you to build your own movie recommender system based on SVD decomposition, so you can combine two perfect things: Numerical Linear Algebra and cinematography!

In order to build recommender system you need data. Here you are https://grouplens.org/datasets/movielens/1m/

Usually all recommender systems may be devided into two groups

Collaborative filtering.¶

This approach is based on user-item interaction. It has one important assumption: user who has liked an item in the past will also likes the same in the future. Suppose the user A likes the films about vampires. He is Twilight saga fan and he has watched the film "What we do in the shadows" and liked it or unliked it, in other words he evaluated it somehow. And suppose another user B, who has the similair behavior to the first user (he is also Twilight saga fan). And the chance, that he will estimate "What we do in the shadows" in the same way that user A did, is huge. So, the purpose of the collaborative filtering is to predict a user's behavior based on behavior of the simular users.

Content based filtering.¶

Collaborative filtering has some essential flaws. The main one is called "cold start". "Cold start" happens when the new user comes and he has not react anyhow to the items. So we do not know his past behavior and we do not know what to advise. Here content based filtering helps. Often resources gather some extra info about users and items before a user comes down to utilising the resource. So, for example we would know that user likes horror movies before he watched anything on the resource.

In this task you will implement Collaborative filtering based on SVD (we will use the function from the proper package and check if the result recommender system advices the similar movies)

1) (1 pts) Explore the data. Construct the interaction matrix $M$ of size $m \times n$ which contains the information of how a certain user rated a certain film.

2) (5 pts) Compute SVD of this matrix. Remeber that matrix $M$ is sparse (one user can hardly watch all the movies) so the good choice would be to use method from scipy.sparse.linalg package

$$ M = USV^{\top}, $$

where $U$ is a $m \times r $ orthogonal matrix with left singular vectors, which represents the relationship between users and latent factors, $S$ is a $r \times r $ diagonal matrix, which describes the strength of each latent factor and $V^\top$ is a $r \times n$ matrix with right singular vectors, which represent the embeddings of items (movies in our case) in latent space. Describe any simple heuristic to choose appropriate value for $r$ and explain why do you expect that it will work.

Problem set 2 (45 + 50 + 33 + 15 = 143 pts)¶

Problem 1 (LU decomposition) 45 pts¶