The complexity to find an LU decomposition of a dense $n\times n$ matrix is $\mathcal{O}(n^3)$.
Significant reduction in complexity can be achieved if the matrix has a certain structure, e.g. it is sparse.
In the following task we consider an important example of $LU$ for a special type of matrices –– band matrices with top left entry equal to 1 and the bandwidth $m$ equal to 3 or 5 which called tridiagonal and pentadiagonal respectively. The bands may be [1, 2, 1]
and [1, 1, 2, 1, 1]
respectively
band_lu(diag_broadcast, n)
which computes LU decomposition for tridiagonal or pentadiagonal matrix with top left entry equal to 1 with given diagonal bands.
For example, input parametres (diag_broadcast = [1,2,1], n = 4)
mean that we need to find LU decomposition for the triangular matrix of the form:Provide the extensive testing of the implemented function that will works correctly for large $n$, e.g. $n=100$.
As an output it is considered to make L
and U
- 2D arrays representing diagonals in factors $L$ (L[0]
keeps first lower diagonal, L[1]
keeps second lower, ...), and $U$ (U[:,0]
keeps main diagonal, U[:,1]
keeps first upper, ...).
(2 pts) Compare execution time of the band LU decomposition using standard function from scipy
, i.e. which takes the whole matrix and does not know about its special structure, and band decomposition of yours implementation. Comment on the results.
(7 pts) Write a function cholesky(n)
for computing Cholesky decomposition. It should take the the single argument - the matrix that will be factorized and return the single output - lower-triangular factor $L$. Think about the efficiency of your implementation and if necessary update it to achieve the best performance (eliminate Python loops, where it is possible and so on). Explicitly describe the difference with LU decomposition that reduces the complexity from $2n^3/3$ for LU to $n^3/3$ for Cholesky.
Test the implemented function on the Pascal matrix of given size $n$ for $n = 5, 10, 50$.
Pascal matrix is square matrix of the following form (here for $n=4$)
$$P = \begin{pmatrix}
1 & 1 & 1 & 1\\
1 & 2 & 3 & 4 \\
1 & 3 & 6 & 10 \\
1 & 4 & 10 & 20 \\
\end{pmatrix}.$$
Here you can find more details about such matrices and analytical form for factor $L$ from Cholesky decomposition. Compare the result of your implementation with analytical expression in terms of some matrix norm of difference.
from scipy.sparse import diags # can be used with broadcasting of scalars if desired dimensions are large
import numpy as np
# INPUT : diag_broadcast - list of diagonals value to broadcast,length equal to 3 or 5; n - integer, band matrix shape.
# OUTPUT : L - 2D np.ndarray, L.shape[0] depends on bandwidth, L.shape[1] = n-1, do not store main diagonal, where all ones; add zeros to the right side of rows to handle with changing length of diagonals.
# U - 2D np.ndarray, U.shape[0] = n, U.shape[1] depends on bandwidth;
# add zeros to the bottom of columns to handle with changing length of diagonals.
def band_lu(diag_broadcast, n):
# enter your code here
raise NotImplementedError()
def cholesky(A):
# enter your code here
raise NotImplementedError()
# Your solution is here
$ A = \begin{pmatrix} 0 & 1 \\ 2 & 3 \end{pmatrix}.$
$B = \begin{pmatrix} 1 & 1 & 0\\ 1 & 1 & 2 \\ 1 & 2 & 1 \end{pmatrix}.$
$A = \begin{pmatrix} 1 & c & 0\\ 2 & 4 & 1 \\ 3 & 5 & 1 \end{pmatrix}.$
# Your solution is here
As you have noticed before, LU decomposition may fail. In order to make it stable, we can use LU decomposition with pivoting (PLU).
We want to find such permutation matrix $P$ that LU decomposition of $PA$ exists
$$ PA = LU $$(7 pts) Implement efficiently PLU decomposition (without loops and with appropriate level of BLAS operations). Also, pay attention to the way of permutation matrix storage.
(4 pts ) Compare your function for computing PLU with built-in function on matrices of such type (mirror_diag = [1,2,1], n = 4)
. (Bandwidth and matrix size may vary). So, you can pass them as dense 2D NumPy array and do not tune your implementation to this special structure. Compare them in terms of running time (use %timeit
magic) for range of dimensions to recover the asymptotic rate of time increasing and in terms of acuracy. We expect you plot the running time vs matrix dimension for built-in function and your implementation. So you should get the plot with two lines.
Consider additionally one of the pathological examples from above, where LU fails, but PLU has to work.
NumPy or JAX are both ok in this problem, but please use the single library for all implementations.
# Your solution is here
Let $A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}$ be a block matrix. The goal is to solve the linear system
$$ \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} u_1 \\ u_2 \end{bmatrix} = \begin{bmatrix} f_1 \\ f_2 \end{bmatrix}. $$where $X$ - nonsingular square matrix.
# Your solution is here
(2 pts) Prove that eigenvectors that correspond to distinct eigenvalues are linearly independent.
(3 pts) $A$ is a matrix such that $a_{i,j} \ge 0$ and $\sum_{j}a_{i,j} = 1$ (sum of the elements in each row is 1). Prove that $A$ has an eigenvalue $\lambda=1$ and that any eigenvalue $\lambda_i$: $|\lambda_i| \le 1$.
(5 pts) Prove that normal matrix is Hermitian iff its eigenvalues are real. Prove that normal matrix is unitary iff its eigenvalues satisfy $|\lambda| = 1$.
(5 pts) The following problem illustrates instability of the Jordan form. Find theoretically the eigenvalues of the perturbed Jordan block (there is only one $\varepsilon$ - in the left lower corner):
Comment how eigenvalues of $J(0)$ are perturbed for large $n$.
# Your solution is here
# INPUT: G - np.ndarray or sparse matrix
# OUTPUT: A - np.ndarray (of size G.shape) or sparse matrix
def pagerank_matrix(G):
# enter your code here
return A
num_iter
. It should be organized as a function power_method(A, x0, num_iter)
that outputs approximation to eigenvector $x$, eigenvalue $\lambda$ and history of residuals $\{\|Ax_k - \lambda_k x_k\|_2\}$. Make sure that the method converges to the correct solution on a matrix $\begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix}$ which is known to have the largest eigenvalue equal to $3$.# INPUT: A - np.ndarray (2D), x0 - np.ndarray (1D), num_iter - integer (positive)
# OUTPUT: x - np.ndarray (of size x0), l - float, res - np.ndarray (of size num_iter + 1 [include initial guess])
def power_method(A, x0, num_iter): # 5 pts
# enter your code here
return x, l, res
num_iter=100
and random initial guess x0
. Explain the absence of convergence. num_iter=100
for 10 different initial guesses and print/plot the resulting approximated eigenvectors. Why do they depend on the initial guess?In order to avoid this problem Larry Page and Sergey Brin proposed to use the following regularization technique:
$$ A_d = dA + \frac{1-d}{N} \begin{pmatrix} 1 & \dots & 1 \\ \vdots & & \vdots \\ 1 & \dots & 1 \end{pmatrix}, $$where $d$ is a small parameter in $[0,1]$ (typically $d=0.85$), which is called damping factor, $A$ is of size $N\times N$. Now $A_d$ is the matrix with multiplicity of the largest eigenvalue equal to 1. Recall that computing the eigenvector of the PageRank matrix, which corresponds to the largest eigenvalue, has the following interpretation. Consider a person who stays in a random node of a graph (i.e. opens a random web page); at each step s/he follows one of the outcoming edges uniformly at random (i.e. opens one of the links). So the person randomly walks through the graph and the eigenvector we are looking for is exactly his/her stationary distribution — for each node it tells you the probability of visiting this particular node. Therefore, if the person has started from a part of the graph which is not connected with the other part, he will never get there. In the regularized model, the person at each step follows one of the outcoming links with probability $d$ OR teleports to a random node from the whole graph with probability $(1-d)$.
(2 pts) Now, run the power method with $A_d$ and plot residuals $\|A_d x_k - \lambda_k x_k\|_2$ as a function of $k$ for $d=0.97$, num_iter=100
and a random initial guess x0
.
(5 pts) Find the second largest in the absolute value eigenvalue of the obtained matrix $A_d$. How and why is it connected to the damping factor $d$? What is the convergence rate of the PageRank algorithm when using damping factor?
Usually, graphs that arise in various areas are sparse (social, web, road networks, etc.) and, thus, computation of a matrix-vector product for corresponding PageRank matrix $A$ is much cheaper than $\mathcal{O}(N^2)$. However, if $A_d$ is calculated directly, it becomes dense and, therefore, $\mathcal{O}(N^2)$ cost grows prohibitively large for big $N$.
(2 pts) Implement fast matrix-vector product for $A_d$ as a function pagerank_matvec(A, d, x)
, which takes a PageRank matrix $A$ (in sparse format, e.g., csr_matrix
), damping factor $d$ and a vector $x$ as an input and returns $A_dx$ as an output.
(1 pts) Generate a random adjacency matrix of size $10000 \times 10000$ with only 100 non-zero elements and compare pagerank_matvec
performance with direct evaluation of $A_dx$.
# INPUT: A - np.ndarray (2D), d - float (from 0.0 to 1.0), x - np.ndarray (1D, size of A.shape[0/1])
# OUTPUT: y - np.ndarray (1D, size of x)
def pagerank_matvec(A, d, x): # 2 pts
# enter your code here
return y
Download the dataset from here, unzip it and put dblp_authors.npz
and dblp_graph.npz
in the same folder with this notebook. Each value (author name) from dblp_authors.npz
corresponds to the row/column of the matrix from dblp_graph.npz
. Value at row i
and column j
of the matrix from dblp_graph.npz
corresponds to the number of times author i
cited papers of the author j
. Let us now find the most significant scientists according to PageRank model over DBLP data.
load_dblp(...)
function. Print its density (fraction of nonzero elements). Find top-10 most cited authors from the weighted adjacency matrix. Now, make all the weights of the adjacency matrix equal to 1 for simplicity (consider only existence of connection between authors, not its weight). Obtain the PageRank matrix $A$ from the adjacency matrix and verify that it is stochastic.pagerank_matvec
to your power_method
(without rewriting it) for fast calculation of $A_dx$, you can create a LinearOperator
: L = scipy.sparse.linalg.LinearOperator(A.shape, matvec=lambda x, A=A, d=d: pagerank_matvec(A, d, x))
L@x
or L.dot(x)
will result in calculation of pagerank_matvec(A, d, x)
and, thus, you can plug $L$ instead of the matrix $A$ in the power_method
directly. Note: though in the previous subtask graph was very small (so you could disparage fast matvec implementation), here it is very large (but sparse), so that direct evaluation of $A_dx$ will require $\sim 10^{12}$ matrix elements to store - good luck with that (^_<).from scipy.sparse import load_npz
import numpy as np
def load_dblp(path_auth, path_graph):
G = load_npz(path_graph).astype(float)
with np.load(path_auth) as data: authors = data['authors']
return G, authors
G, authors = load_dblp('dblp_authors.npz', 'dblp_graph.npz')
# Your code is here
# INPUT:
# A_init - square matrix,
# num_iter - number of iterations for QR algorithm
# OUTPUT:
# Ak - transformed matrix A_init given by QR algorithm,
# convergence - numpy array of shape (num_iter, ),
# where we store the maximal number from the Chebyshev norm
# of triangular part of the Ak for every iteration
def qr_algorithm(A_init, num_iter): # 3 pts
# enter your code here
return Ak, convergence
plt.spy(Ak, precision=1e-7)
.# Your solution is here
plt.spy(Ak, precision=1e-7)
. Is this matrix lower triangular? How does this correspond to the claim about convergence of the QR algorithm?# Your solution is here
In the lectures the Rayleigh Quotient shift was introduced to speed up convergence of power method. Here we ask you to generalize this approach to construct the shifts in QR algorithm.
def qr_algorithm_reileigh(A_init, num_iter):
# enter your code here
return Ak, convergence
# Your solution is here
To solve the problem that appears in the last example, we can use the Wilkinson shift:
$$\mu = a_m - \frac {sign(\delta) b^2_{m-1}} {(|\delta| + \sqrt{\delta^2 + b^2_{m-1}} )},$$where $\delta = \frac{(a_{m-1} - a_m)}{2}$. If $\delta = 0$, then instead of $sign(\delta)$ you have to choose $1$ or $-1$ arbitrary. The numbers $a_m, b_{m-1}, a_{m-1}$ are taken from matrix $B$:
$$ B = \begin{bmatrix} a_{m-1} & b_{m-1} \\ b_{m-1} & a_m \\ \end{bmatrix}, $$
which is a lower right bottom submatrix of $A^{(k)}$. Here $k$ is an iteration counter in QR algorithm.
def qr_algorithm_wilkinson(A_init, num_iter):
# enter your code here
return Ak, convergence
# Your solution is here
Imagine the world without NLA where you have free evenings and you can watch movies!
But it is always hard to choose a movie to watch.
In this problem we suggest you to build your own movie recommender system based on SVD decomposition, so you can combine two perfect things: Numerical Linear Algebra and cinematography!
In order to build recommender system you need data. Here you are https://grouplens.org/datasets/movielens/1m/
Usually all recommender systems may be devided into two groups
This approach is based on user-item interaction. It has one important assumption: user who has liked an item in the past will also likes the same in the future. Suppose the user A likes the films about vampires. He is Twilight saga fan and he has watched the film "What we do in the shadows" and liked it or unliked it, in other words he evaluated it somehow. And suppose another user B, who has the similair behavior to the first user (he is also Twilight saga fan). And the chance, that he will estimate "What we do in the shadows" in the same way that user A did, is huge. So, the purpose of the collaborative filtering is to predict a user's behavior based on behavior of the simular users.
Collaborative filtering has some essential flaws. The main one is called "cold start". "Cold start" happens when the new user comes and he has not react anyhow to the items. So we do not know his past behavior and we do not know what to advise. Here content based filtering helps. Often resources gather some extra info about users and items before a user comes down to utilising the resource. So, for example we would know that user likes horror movies before he watched anything on the resource.
1) (1 pts) Explore the data. Construct the interaction matrix $M$ of size $m \times n$ which contains the information of how a certain user rated a certain film.
2) (5 pts) Compute SVD of this matrix. Remeber that matrix $M$ is sparse (one user can hardly watch all the movies) so the good choice would be to use method from scipy.sparse.linalg
package
where $U$ is a $m \times r $ orthogonal matrix with left singular vectors, which represents the relationship between users and latent factors, $S$ is a $r \times r $ diagonal matrix, which describes the strength of each latent factor and $V^\top$ is a $r \times n$ matrix with right singular vectors, which represent the embeddings of items (movies in our case) in latent space. Describe any simple heuristic to choose appropriate value for $r$ and explain why do you expect that it will work.
# Importing Libraries
import numpy as np
import pandas as pd
# Read the dataset
# Create the interaction matrix
# Normalize the matrix
# Compute Singular Value Decomposition of interaction matrix. You can use built-in functions
U,S, V = svd(M) # Update this line, it is just example
3) (2 pts) In order to get weighted item-latent factors, we can multiply $S$ and $V^{T}$. Please, remember that $S$ is diagonal and multiply them efficiently.
# Your solutuion is here
Now we have vectors that represent our item space. In other words we have $N$ movies and $N$ vectors which describe each movie, a.k.a. embeddings. In order to know if two movies are similar or not we need just to check if the corresponding vectors are similair or not. How we can do this?
4) (2 pts) Implement the cosine metric. If the cosine metric between two vectors equals to $1$ both vectors are collinear, if $0$ vectors are orthogonal, as a result corresponding movies are completely different.
$$ cosine(u,v) = \frac{u^{\top}v}{\|u\|_2\|v\|_2} $$# Your solutuion is here
5) (5 pts) Check your result. Implement the fuction, which finds and prints $k$ similar movies to the one you have chosen
# Your solutuion is here
Enjoy watching the recommended movies!