Problem Set 1 (92 points)¶

Important information¶

We provide signatures of the functions that you have to implement. Make sure you follow the signatures defined, otherwise your coding solutions will not be graded.

Problem 1 (Time series forecasting with simple NLA) (31 pts)¶

Many daily-life events can be described using time series: $x_t$, $t \in \{0, \ldots, n\}$, e.g. closing price of a stock on day $t$ or number of users on a website in hour $t$.

Below you are given time series example $x \in \mathbb{R}^n$:

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('https://nla.skoltech.ru/homeworks/files/data.csv')
x = data['Val'].values
plt.plot(x)

[<matplotlib.lines.Line2D at 0x18a5e1bb448>]

Raw time dependent data usually looks rather unpretty. However, it can be often considered as a sum of some trend + some random noise. For example, a periodic trend is easily observed in the series above.

1. Your first task is to find truly periodic series which will best approximate the original data $x$.

Let $z \in \mathbb{R}^m$ denotes one period of some given periodic data $x_{per} \in \mathbb{R}^n$ (assuming $n$ is miltiple of $m$), i.e.:

$$x_{per} = (z,z,\ldots,z),$$

or in the matrix form:

$$Az = x_{per},$$

with "tall" matrix $A \in \mathbb{R}^{n\times m}$, $n > m$.

The actual data $x$ is non-periodic, hence the solution (in appropriate sense) of the following overdetermined linear system is required (think why):

$$Az = x.$$

(3 pts) Choose $m$ and implement function to construct matrix $A$ according to the template below. What structure does this matrix have? Hint: you can use numpy.vstack() function. Describe how do you choose $m$. Also, suggest what are the data you are working with?

def build_A(n, m):
    '''
    Input: positive integers n, m
    Output: np.array of size (n, m)
    '''
    # Pure loops are strongly prohibited! 
    # Comprehension expressions are possible. 
    raise NotImplementedError()

A = build_A(x.shape[0], m)

(3 pts) Find the best periodic approximation $z^*$ as the solution of the above overdetermined linear system. Below you are supposed to compute pseudo-inverse of $A$ (by definition) according to the provided template.

def build_pseudoinverse(A):
    '''
    Input: np.array of size (n, m)
    Output: np.array of size (m, n)
    '''
    raise NotImplementedError()

A_inv = build_pseudoinverse(A)
z = A_inv @ x

(1 pts) Plot the original data $x$ and the periodic data $Az^*$ in one plot. Do not forget to add a proper legend!

# Place for your plots

If your solution is correct, you see that the obtained periodic approximation is non-smooth.

To ensure the smoothness in our periodic approximation, consider the following regularized least-squares problem:

$$L(z) = ||Az - x||^2 + \lambda R(z) \to \min_z ,$$

where $R(z)$ is some regularization term and $\lambda > 0$. The term $R(z)$ reflects our additional assumptions on the smoothness of the periodic approximation $z$. One of the possible form of this term is:

$$ R(z) = \sum_{i=1}^m (z_{i+1} - z_i)^2.$$

Another form of the same regularization term is:

$$R(z) = \|Dz\|_2^2.$$

Intuitevely, the regularization term aims to penalize "sharp" series, and the parameter $\lambda$ provides a trade-off between approximation quality (the first term) and the smoothness degree (the second term). Note, the case $\lambda = 0$ corresponds to the previously obtained solution $z^*$ (why?).

2. Your second task is to find the best smooth periodic approximation $z^*_s$ of the original data $x$.

(5 pts) Implement function to construct matrix $D$ according to the template provided below. Don't forget about periodicity! What is the meaning of $D$?

def build_D(m):
  '''
    Input: positive integer m
    Output: np.array of size (m, m)
  '''
  raise NotImplementedError()

D = build_D(z.shape[0])

(3 pts) The optimality condition for the ordinary least-squares problem was derived during the lecture course. Using the same technique, for a given $\lambda$ derive an analog of the normal equation for the regularized least-squares problem formulated above.

# Your derivation is here!

(1 pts) Solve the derived system. You are supposed to use previously constructed matrices $A$ and $D$.

lmbda = 1. # You are supposed to play with parameter $\lambda$ during your study

# Construct system

# z_s = np.linalg.solve(matrix, rhs)

(2 pts) Plot the original data $x$ and smooth periodic approximation $Az^*_s$ for three increasing values of $\lambda$. You have to get three plots with two time series $x$ and $Az^*_s$ in every plot. How $\lambda$ affects plots?

# Your solution is here

3. The final step is to build a model for predicting time series.

- We want to predict $x_{k+1}$ based on $M$ previous observations.

- Consider the following (linear) auto-regression (AR) model:

$$\hat{x}_{k+1} = \theta_1 x_k + \theta_2 x_{k-1} + \ldots + \theta_M x_{k-M+1},$$

with $M$ parameters $\theta_1, \ldots, \theta_M$ to be determined.

- We fit the parameters by minimizing squared norm of the residual (again!):

$$(\hat{x}_{M+1} - x_{M+1})^2 + \ldots + (\hat{x}_n - x_n)^2 \to \min_{\mathbf{\theta}}.$$

- This problem can be reformulated in a familiar form:

$$||X\theta - x||_2^2 \to \min_{\theta}.$$

(7 pts) Implement function that constructs matrix $X \in \mathbb{R}^{(n-M) \times M}$ according to the template below. What structure does it have? Use appropriate function from NumPy/SciPy stack.

def build_X(x, M):
  '''
    Input: np.array of size n, positive integer M
    Output: np.array of size (n - M, M)
  '''  
  raise NotImplementedError()

Note, that the previously obtained smooth periodic approximation $z_s$ by itself can be regarded as a predictive model. However, we want to include the noise term in the model too, by fitting it with AR model:

x_n = x - A @ z_s # Extract noise
train_x = x_n[:x_n.shape[0] // 2] #  Train data to fit the model
test_x = x_n[x_n.shape[0] // 2:] #  Test data to validate the model
M = 3 # You are supposed to find the best parameter $M$ during your study
X_train = build_X(x_train, M)

(3 pts) Once you constructed $X$ based on the training data, the fitting of the parameters $\theta$ again requires the solution of the linear least-squares problem. This time you are supposed to compute pseudoinverse of $X$ via QR decomposition. Hint: use numpy.linalg.qr() function.

def build_pseudoinverse_QR(A):
  '''
    Input: np.array of size (n - M, M)
    Output: np.array of size (M, n - M)
  '''  
  raise NotImplementedError()

X_inv = build_pseudoinverse_QR(X_train)
theta = X_inv @ train_x[M:]

(3 pts) Check your model with test data. Plot your predicted noise series and test noise series in one plot. Next, add smooth trend $Az_s$ back to predicted noise series and plot it with the original series $x$ in one plot. You have to get two plots with two time series in every plot. Be careful with indexing!

test_X = build_X(test_x)
predicted = test_X @ theta

# Plotting

(Bonus) How valuable is your model for real time prediction? Try to perform honest time-marching with your model starting from $x_{M+1}$ to $x_{M+100}$. Is it still good? Try to choose the best parameters $M$ and $\lambda$.

# Your solution is here

Problem 2 (Theoretical tasks) (26 pts)¶

1.

(1 pts) what are the constants $C_1$ and $C_2$ such that $C_1 \|x\|_{\infty} \leq \|x\|_2 \leq C_2 \| x\|_{\infty}$
(5 pts) Prove that $\| U A \|_F = \| A U \|_F = \| A \|_F$ for any unitary matrix $U$.
(5 pts) Prove that $\| U A \|_2 = \| A U \|_2 = \| A \|_2$ for any unitary matrix $U$.

2.

(5 pts) Using the results from the previous subproblem, prove that $\| A \|_F \le \sqrt{\mathrm{rank}(A)} \| A \|_2$. Hint: SVD will help you.
(5 pts) Show that for any $m, n$ and $k \le \min(m, n)$ there exists $A \in \mathbb{R}^{m \times n}: \mathrm{rank}(A) = k$, such that $\| A \|_F = \sqrt{\mathrm{rank}(A)} \| A \|_2$. In other words, show that the previous inequality is not strict.
(5 pts) Prove that if $\mathrm{rank}(A) = 1$, then $\| A \|_F = \| A \|_2$.
(5 pts) Prove that $\| A B \|_F \le \| A \|_2 \| B \|_F$.

# Your solution is here

Problem 3 (Matrix calculus) (15 pts)¶

1. (5 pts) Consider the following function

$$ F(U, V) = \frac{1}{2}\|X - UV\|_F^2, $$

where $X \in \mathbb{R}^{n \times n}$, $U \in \mathbb{R}^{n \times k}$ and $V \in \mathbb{R}^{k \times n}$ and $k < n$.

(2 pts) Derive analytical expression for the gradient of the function $F$ with respect to $U$
(2 pts) Derive analytical expression for the gradient of the function $F$ with respect to $V$
(1 pts) Estimate computational complexity of computing these gradients (in big-O notation).

2. (2 pts) Derive analytical expression for the gradient of the function $f$:

$$ R(x) = \frac{(Ax, x)}{(x, x)}, $$

where $A$ is a symmetric real matrix. Why the gradient of this function is important in NLA you will know in the lectures later.

3. (8 pts) Consider the following function $f$

$$f(w) = \log\det\left(\sum_{i=1}^m w_i x_i x_i^{\top}\right),$$

where $x_i, \; i = 1,\dots,m$ are given column vectors.

(3 pts) Derive analytical expression for the gradient of $f$
(1 pts) For what values of $m$ and vectors $x_i$ the function $f$ makes sense and is finite?
(4 pts) Consider two approaches to compute it: directly with matrix products and so on and the single-line solution with einsum function.
Generate some set of vectors $x_i \in \mathbb{R}^{1000}$ such that the funtion $f$ is finite and compare the time of computing derived gradient with these approaches. Use %timeit command to measure time. What do you think about the reason of such behaviour?

# Your solution is here

Problem 4. Compression of the fully-connected layers in neural network with simple architecture (20 pts)¶

In this problem we consider the neural network that performs classification of the dataset of images. Any neural network can be considered as composition of simple linear and non-linear functions. For example, a neural network with 3 layers can be represented as

$$f_3(f_2(f_1(x, w_1), w_2), w_3),$$

where $x$ is input data (in our case it will be images) and $w_i, \; i =1,\dots,3$ are parameters that are going to be trained.

We will study the compression potential of neural network with simple architecture: alternating some numbers of linear and non-linear functions.

The main task in this problem is to study how the compression of fully-connected layers affects the test accuracy. Any fully-connected layer is represented as linear function $AX + B$, where $X$ is input matrix and $A, B$ are trainable matrices. Matrices $A$ in every layer are going to be compressed. The main result that you should get is the plot of dependence of test accuracy on the total number of parameters in the neural network.

Zero step: install PyTorch¶

Follow the steps in official instructions

First step: download CIFAR10 dataset¶

import torch
import torchvision
import torchvision.transforms as transforms
from torchvision import datasets, transforms
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np

batch_size = 100

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

train_loader = torch.utils.data.DataLoader(datasets.CIFAR10('./', train=True, download=True, transform=transform), 
                                        batch_size=batch_size, shuffle=True)

test_loader = torch.utils.data.DataLoader(datasets.CIFAR10('./', train=False, transform=transform), 
                                          batch_size=batch_size, shuffle=True)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Check what images are we going to classify¶

def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.figure(figsize=(20, 10))
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# get some random training images
dataiter = iter(train_loader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(8)))

  dog horse truck plane plane horse horse   dog

Second step: neural network architecture¶

For simplicity and demonstration purposes of the neural network compression idea consider the architecture consisting of the only fully-connected layers and non-linear ReLU functions between them. To demonstrate compression effect, consider the dimension of the inner layers equals to 1000.

Below you see implementation of such neural network in PyTorch. More details about neural networks you will study in the Deep learning course in one of the upcoming term

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(3 * 32 * 32, 1000)
        self.fc2 = nn.Linear(1000, 1000)
        self.fc3 = nn.Linear(1000, 1000)
        self.fc4 = nn.Linear(1000, 1000)
        self.fc5 = nn.Linear(1000, 1000)
        self.fc6 = nn.Linear(1000, 10)
        self.ReLU = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x.view(-1, 3 * 32*32))
        x = self.ReLU(x)
        x = self.fc2(x)
        x = self.ReLU(x)
        x = self.fc3(x)
        x = self.ReLU(x)
        x = self.fc4(x)
        x = self.ReLU(x)
        x = self.fc5(x)
        x = self.ReLU(x)
        x = self.fc6(x)
        return F.log_softmax(x, dim=1)

Implement functions for training and testing after every sweep over all dataset entries¶

def train(model, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

def test(model, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
#             data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()
    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))

Set parameters for training and print intermediate loss values¶

log_interval = 50
epochs = 7

Third step: run training with the Adam optimization method¶

If your laptop is not very fast, you will wait some time till training is finished.

model = Net()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(1, epochs + 1):
    train(model,  train_loader, optimizer, epoch)
    test(model, test_loader)

Train Epoch: 1 [0/50000 (0%)]	Loss: 2.303509
Train Epoch: 1 [5000/50000 (10%)]	Loss: 2.017663
Train Epoch: 1 [10000/50000 (20%)]	Loss: 2.012401
Train Epoch: 1 [15000/50000 (30%)]	Loss: 1.880147
Train Epoch: 1 [20000/50000 (40%)]	Loss: 1.677673
Train Epoch: 1 [25000/50000 (50%)]	Loss: 1.762426
Train Epoch: 1 [30000/50000 (60%)]	Loss: 1.699280
Train Epoch: 1 [35000/50000 (70%)]	Loss: 1.811280
Train Epoch: 1 [40000/50000 (80%)]	Loss: 1.371416
Train Epoch: 1 [45000/50000 (90%)]	Loss: 1.669703

Test set: Average loss: 1.5764, Accuracy: 4378/10000 (44%)

Train Epoch: 2 [0/50000 (0%)]	Loss: 1.346880
Train Epoch: 2 [5000/50000 (10%)]	Loss: 1.285201
Train Epoch: 2 [10000/50000 (20%)]	Loss: 1.515415
Train Epoch: 2 [15000/50000 (30%)]	Loss: 1.770635
Train Epoch: 2 [20000/50000 (40%)]	Loss: 1.582595
Train Epoch: 2 [25000/50000 (50%)]	Loss: 1.431566
Train Epoch: 2 [30000/50000 (60%)]	Loss: 1.358462
Train Epoch: 2 [35000/50000 (70%)]	Loss: 1.410893
Train Epoch: 2 [40000/50000 (80%)]	Loss: 1.661417
Train Epoch: 2 [45000/50000 (90%)]	Loss: 1.529408

Test set: Average loss: 1.4822, Accuracy: 4786/10000 (48%)

Train Epoch: 3 [0/50000 (0%)]	Loss: 1.470752
Train Epoch: 3 [5000/50000 (10%)]	Loss: 1.197014
Train Epoch: 3 [10000/50000 (20%)]	Loss: 1.482764
Train Epoch: 3 [15000/50000 (30%)]	Loss: 1.230034
Train Epoch: 3 [20000/50000 (40%)]	Loss: 1.359152
Train Epoch: 3 [25000/50000 (50%)]	Loss: 1.512282
Train Epoch: 3 [30000/50000 (60%)]	Loss: 1.559847
Train Epoch: 3 [35000/50000 (70%)]	Loss: 1.481677
Train Epoch: 3 [40000/50000 (80%)]	Loss: 1.335407
Train Epoch: 3 [45000/50000 (90%)]	Loss: 1.456683

Test set: Average loss: 1.4086, Accuracy: 5091/10000 (51%)

Train Epoch: 4 [0/50000 (0%)]	Loss: 1.316813
Train Epoch: 4 [5000/50000 (10%)]	Loss: 1.349132
Train Epoch: 4 [10000/50000 (20%)]	Loss: 1.344821
Train Epoch: 4 [15000/50000 (30%)]	Loss: 1.472614
Train Epoch: 4 [20000/50000 (40%)]	Loss: 1.342841
Train Epoch: 4 [25000/50000 (50%)]	Loss: 1.232694
Train Epoch: 4 [30000/50000 (60%)]	Loss: 1.187062
Train Epoch: 4 [35000/50000 (70%)]	Loss: 1.362742
Train Epoch: 4 [40000/50000 (80%)]	Loss: 1.120951
Train Epoch: 4 [45000/50000 (90%)]	Loss: 1.142083

Test set: Average loss: 1.4046, Accuracy: 5173/10000 (52%)

Train Epoch: 5 [0/50000 (0%)]	Loss: 1.248556
Train Epoch: 5 [5000/50000 (10%)]	Loss: 1.106759
Train Epoch: 5 [10000/50000 (20%)]	Loss: 1.129976
Train Epoch: 5 [15000/50000 (30%)]	Loss: 1.121431
Train Epoch: 5 [20000/50000 (40%)]	Loss: 1.309171
Train Epoch: 5 [25000/50000 (50%)]	Loss: 1.135936
Train Epoch: 5 [30000/50000 (60%)]	Loss: 1.333919
Train Epoch: 5 [35000/50000 (70%)]	Loss: 1.103451
Train Epoch: 5 [40000/50000 (80%)]	Loss: 1.062910
Train Epoch: 5 [45000/50000 (90%)]	Loss: 1.201752

Test set: Average loss: 1.3646, Accuracy: 5323/10000 (53%)

Train Epoch: 6 [0/50000 (0%)]	Loss: 1.094037
Train Epoch: 6 [5000/50000 (10%)]	Loss: 0.929148
Train Epoch: 6 [10000/50000 (20%)]	Loss: 1.146780
Train Epoch: 6 [15000/50000 (30%)]	Loss: 1.208604
Train Epoch: 6 [20000/50000 (40%)]	Loss: 1.335437
Train Epoch: 6 [25000/50000 (50%)]	Loss: 1.109769
Train Epoch: 6 [30000/50000 (60%)]	Loss: 1.153895
Train Epoch: 6 [35000/50000 (70%)]	Loss: 1.133219
Train Epoch: 6 [40000/50000 (80%)]	Loss: 1.035433
Train Epoch: 6 [45000/50000 (90%)]	Loss: 1.206637

Test set: Average loss: 1.3586, Accuracy: 5422/10000 (54%)

Train Epoch: 7 [0/50000 (0%)]	Loss: 0.927365
Train Epoch: 7 [5000/50000 (10%)]	Loss: 1.119912
Train Epoch: 7 [10000/50000 (20%)]	Loss: 0.962684
Train Epoch: 7 [15000/50000 (30%)]	Loss: 1.104071
Train Epoch: 7 [20000/50000 (40%)]	Loss: 1.173813
Train Epoch: 7 [25000/50000 (50%)]	Loss: 0.904804
Train Epoch: 7 [30000/50000 (60%)]	Loss: 1.081464
Train Epoch: 7 [35000/50000 (70%)]	Loss: 0.863629
Train Epoch: 7 [40000/50000 (80%)]	Loss: 1.030450
Train Epoch: 7 [45000/50000 (90%)]	Loss: 1.005699

Test set: Average loss: 1.3439, Accuracy: 5435/10000 (54%)

Now we have somehow trained neural network and we are ready to perform compression of the weigths in the fully-connected layers.

(3 pts) Compute SVD of the matrix $1000 \times 1000$, which corresponds to a weight matrix $A$ in any layer of the trained neural network of the appropriate dimension. To find more information about accessing this matrix please refer to PyTorch manual. Plot decaying of the singular values like it was shown in the lecture. What conclusion can you make?
(12 pts) Create a new model, which is analogue to the class Net, but with some significant distinctions. It takes as input parameters the instance of the class Net and compression rank $r > 0$. After that, this model has to compress all matrices $A$ in fully-connected layers with SVD using first $r$ singular vectors and singular values. Pay attention to efficiently storing of compress representation of the layers. Also forward method of your new model has to be implemented in a way to use compressed representation of the fully-connected layers. In all other aspects it has to reproduce forward method in the original non-compressed model (number of layers, activations, loss function etc).
(5 pts) Plot dependence of test accuracy on the number of parameters in the compressed model. This number of parameters obviously depends on the compression rank $r$. Also plot dependence of time to compute inference on the compression rank $r$. Explain obtained results. To measure time, use %timeit with necessary parameters (examples of using this command see in lectures)

# Your solution is here

Problem 5 (Bonus)¶

The norm is called absolute if $\|x\|=\| \lvert x \lvert \|$ holds for any vector $x$, where $x=(x_1,\dots,x_n)^T$ and $\lvert x \lvert = (\lvert x_1 \lvert,\dots, \lvert x_n \lvert)^T$. Give an example of a norm which is not absolute.
Write a function ranks_HOSVD(A, eps) that calculates Tucker ranks of a d-dimensional tensor $A$ using High-Order SVD (HOSVD) algorithm, where eps is the relative accuracy in the Frobenius norm between the approximated and the initial tensors. Details can be found here on Figure 4.3.
```
def ranks_HOSVD(A, eps):
   return r #r should be a tuple of ranks r = (r1, r2, ..., rd)
```