Problem Set 1 (96 points)

Important information

  1. We provide signatures of the functions that you have to implement. Make sure you follow the signatures defined, otherwise your coding solutions will not be graded.

  2. Please submit the single Jupyter Notebook file, where only Python and Markdown/$\LaTeX$ are used. Any hand-written solutions inserted by photos or in any other way are prohibitive and will not be graded. If you will have any questions about using Markdown, ask them!

Problem 1 (Theoretical tasks) (36 pts)

1.

2.

Hint: $\exp(S^{-1}AS) = \sum\limits_{k=0}^{\infty}\frac{(S^{-1}AS)^k}{k!} = S^{-1}\exp(A)S$, moreover $\exp(D) = \mathrm{diag}(\exp(d_1), \exp(d_2), ..., \exp(d_n))$ where $D = \mathrm{diag}(d_1, d_2, ..., d_n)$ - diagonal matrix with diagonal values $d_1, d_2, ..., d_n$

3.

$$ A = \begin{bmatrix} 2\\ 3\\ \vdots\\ \sqrt{5n-1} \end{bmatrix} \begin{bmatrix} -1 & 1 & -1 & \ldots & (-1)^{n} \end{bmatrix} $$

Problem 2 (Matrix calculus) (15 pts)

1. (11 pts) Consider the following function

$$ F(U, V) = \frac{1}{2}\|X - UDV\|_F^2, $$

where $X \in \mathbb{R}^{n \times n}$, $U \in \mathbb{R}^{n \times k}$, $V \in \mathbb{R}^{k \times n}$, $k < n$ and $D = diag(d_1, \ldots, d_k)$ is given diagonal matrix.

2. (4 pts) Derive analytical expression for the gradient and hessian of the function $f$:

$$ R(x) = \frac{(Ax, x)}{(x, x)}, $$

where $A$ is a symmetric real matrix. Why the gradient of this function is important in NLA you will know in the lectures later.

Problem 3. Compression of the fully-connected layers in neural network with simple architecture (30 pts)

In this problem we consider the neural network that performs classification of the dataset of images. Any neural network can be considered as composition of simple linear and non-linear functions. For example, a neural network with 3 layers can be represented as

$$f_3(f_2(f_1(x, w_1), w_2), w_3),$$

where $x$ is input data (in our case it will be images) and $w_i, \; i =1,\dots,3$ are parameters that are going to be trained.

We will study the compression potential of neural network with simple architecture: alternating some numbers of linear and non-linear functions.

The main task in this problem is to study how the compression of fully-connected layers affects the test accuracy. Any fully-connected layer is represented as linear function $AX + B$, where $X$ is input matrix and $A, B$ are trainable matrices. Matrices $A$ in every layer are going to be compressed. The main result that you should get is the plot of dependence of test accuracy on the total number of parameters in the neural network.

Zero step: install PyTorch

First step: download CIFAR10 dataset

Check what images are we going to classify

Second step: neural network architecture

For simplicity and demonstration purposes of the neural network compression idea consider the architecture consisting of the only fully-connected layers and non-linear ReLU functions between them. To demonstrate compression effect, consider the dimension of the inner layers equals to 1000.

Below you see implementation of such neural network in PyTorch. More details about neural networks you will study in the Deep learning course in one of the upcoming term

Implement functions for training and testing after every sweep over all dataset entries

Set parameters for training and print intermediate loss values

Third step: run training with the Adam optimization method

If your laptop is not very fast, you will wait some time till training is finished.

Now we have somehow trained neural network and we are ready to perform compression of the weigths in the fully-connected layers.

Problem 4. «Reducio!» (15 pts)

Hermione Granger is well versed in all magical disciplines. In particular, she is a great expert in Numerical Linear Algebra and JAX.

She has invited Harry Potter to play the game.

Hermione chooses a number $r \in [1, 95]$ and two matrices $W_1 \in \mathbb{R}^{r \times 100}$ and $W_2 \in \mathbb{R}^{100 \times r}$. Harry can tell her any 100-dimensional vector $x$, and Hermione gives Harry the result of

$$ ||\sin(W_2 \cos(W_1 x))||^2_2,$$

where trigonometric functions are applied element-wise. The result is the squared norm of a 100-dimensional vector.

Not to lose, Harry should guess what number $r$ Hermione has chosen.   Harry knows the python language, but he is an absolute layman in algebra. Please, help him to get at least 95% correct answers!

Hint 1: SVD might help you, but use it wisely!

Hint 2: Suppose that a special magic allows you to compute gradients through automatic differentiation in JAX. You can also estimate gradients via finite differences, if you want it.

Hint 3: You can estimate the matrix rank using simple heuristics.

You should write code into the harry_answers function. Good luck!