We provide signatures of the functions that you have to implement. Make sure you follow the signatures defined, otherwise your coding solutions will not be graded.
Many daily-life events can be described using time series: $x_t$, $t \in \{0, \ldots, n\}$, e.g. closing price of a stock on day $t$ or number of users on a website in hour $t$.
Below you are given time series example $x \in \mathbb{R}^n$:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('https://nla.skoltech.ru/homeworks/files/data.csv')
x = data['Val'].values
plt.plot(x)
Raw time dependent data usually looks rather unpretty. However, it can be often considered as a sum of some trend + some random noise. For example, a periodic trend is easily observed in the series above.
1. Your first task is to find truly periodic series which will best approximate the original data $x$.
Let $z \in \mathbb{R}^m$ denotes one period of some given periodic data $x_{per} \in \mathbb{R}^n$ (assuming $n$ is miltiple of $m$), i.e.:
$$x_{per} = (z,z,\ldots,z),$$or in the matrix form:
$$Az = x_{per},$$with "tall" matrix $A \in \mathbb{R}^{n\times m}$, $n > m$.
The actual data $x$ is non-periodic, hence the solution (in appropriate sense) of the following overdetermined linear system is required (think why):
$$Az = x.$$numpy.vstack()
function. Describe how do you choose $m$. Also, suggest what are the data you are working with?def build_A(n, m):
'''
Input: positive integers n, m
Output: np.array of size (n, m)
'''
# Pure loops are strongly prohibited!
# Comprehension expressions are possible.
raise NotImplementedError()
A = build_A(x.shape[0], m)
def build_pseudoinverse(A):
'''
Input: np.array of size (n, m)
Output: np.array of size (m, n)
'''
raise NotImplementedError()
A_inv = build_pseudoinverse(A)
z = A_inv @ x
# Place for your plots
If your solution is correct, you see that the obtained periodic approximation is non-smooth.
To ensure the smoothness in our periodic approximation, consider the following regularized least-squares problem:
$$L(z) = ||Az - x||^2 + \lambda R(z) \to \min_z ,$$where $R(z)$ is some regularization term and $\lambda > 0$. The term $R(z)$ reflects our additional assumptions on the smoothness of the periodic approximation $z$. One of the possible form of this term is:
$$ R(z) = \sum_{i=1}^m (z_{i+1} - z_i)^2.$$Another form of the same regularization term is:
$$R(z) = \|Dz\|_2^2.$$Intuitevely, the regularization term aims to penalize "sharp" series, and the parameter $\lambda$ provides a trade-off between approximation quality (the first term) and the smoothness degree (the second term). Note, the case $\lambda = 0$ corresponds to the previously obtained solution $z^*$ (why?).
2. Your second task is to find the best smooth periodic approximation $z^*_s$ of the original data $x$.
def build_D(m):
'''
Input: positive integer m
Output: np.array of size (m, m)
'''
raise NotImplementedError()
D = build_D(z.shape[0])
# Your derivation is here!
lmbda = 1. # You are supposed to play with parameter $\lambda$ during your study
# Construct system
# z_s = np.linalg.solve(matrix, rhs)
# Your solution is here
3. The final step is to build a model for predicting time series.
- We want to predict $x_{k+1}$ based on $M$ previous observations.
- Consider the following (linear) auto-regression (AR) model:
$$\hat{x}_{k+1} = \theta_1 x_k + \theta_2 x_{k-1} + \ldots + \theta_M x_{k-M+1},$$with $M$ parameters $\theta_1, \ldots, \theta_M$ to be determined.
- We fit the parameters by minimizing squared norm of the residual (again!):
$$(\hat{x}_{M+1} - x_{M+1})^2 + \ldots + (\hat{x}_n - x_n)^2 \to \min_{\mathbf{\theta}}.$$- This problem can be reformulated in a familiar form:
$$||X\theta - x||_2^2 \to \min_{\theta}.$$NumPy/SciPy
stack. def build_X(x, M):
'''
Input: np.array of size n, positive integer M
Output: np.array of size (n - M, M)
'''
raise NotImplementedError()
Note, that the previously obtained smooth periodic approximation $z_s$ by itself can be regarded as a predictive model. However, we want to include the noise term in the model too, by fitting it with AR model:
x_n = x - A @ z_s # Extract noise
train_x = x_n[:x_n.shape[0] // 2] # Train data to fit the model
test_x = x_n[x_n.shape[0] // 2:] # Test data to validate the model
M = 3 # You are supposed to find the best parameter $M$ during your study
X_train = build_X(x_train, M)
numpy.linalg.qr()
function.def build_pseudoinverse_QR(A):
'''
Input: np.array of size (n - M, M)
Output: np.array of size (M, n - M)
'''
raise NotImplementedError()
X_inv = build_pseudoinverse_QR(X_train)
theta = X_inv @ train_x[M:]
test_X = build_X(test_x)
predicted = test_X @ theta
# Plotting
# Your solution is here
1.
2.
# Your solution is here
1. (5 pts) Consider the following function
$$ F(U, V) = \frac{1}{2}\|X - UV\|_F^2, $$where $X \in \mathbb{R}^{n \times n}$, $U \in \mathbb{R}^{n \times k}$ and $V \in \mathbb{R}^{k \times n}$ and $k < n$.
2. (2 pts) Derive analytical expression for the gradient of the function $f$:
$$ R(x) = \frac{(Ax, x)}{(x, x)}, $$where $A$ is a symmetric real matrix. Why the gradient of this function is important in NLA you will know in the lectures later.
3. (8 pts) Consider the following function $f$
$$f(w) = \log\det\left(\sum_{i=1}^m w_i x_i x_i^{\top}\right),$$where $x_i, \; i = 1,\dots,m$ are given column vectors.
einsum
function.# Your solution is here
In this problem we consider the neural network that performs classification of the dataset of images. Any neural network can be considered as composition of simple linear and non-linear functions. For example, a neural network with 3 layers can be represented as
$$f_3(f_2(f_1(x, w_1), w_2), w_3),$$where $x$ is input data (in our case it will be images) and $w_i, \; i =1,\dots,3$ are parameters that are going to be trained.
We will study the compression potential of neural network with simple architecture: alternating some numbers of linear and non-linear functions.
The main task in this problem is to study how the compression of fully-connected layers affects the test accuracy. Any fully-connected layer is represented as linear function $AX + B$, where $X$ is input matrix and $A, B$ are trainable matrices. Matrices $A$ in every layer are going to be compressed. The main result that you should get is the plot of dependence of test accuracy on the total number of parameters in the neural network.
import torch
import torchvision
import torchvision.transforms as transforms
from torchvision import datasets, transforms
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
batch_size = 100
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_loader = torch.utils.data.DataLoader(datasets.CIFAR10('./', train=True, download=True, transform=transform),
batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(datasets.CIFAR10('./', train=False, transform=transform),
batch_size=batch_size, shuffle=True)
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.figure(figsize=(20, 10))
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# get some random training images
dataiter = iter(train_loader)
images, labels = dataiter.next()
# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(8)))
For simplicity and demonstration purposes of the neural network compression idea consider the architecture consisting of the only fully-connected layers and non-linear ReLU functions between them. To demonstrate compression effect, consider the dimension of the inner layers equals to 1000.
Below you see implementation of such neural network in PyTorch. More details about neural networks you will study in the Deep learning course in one of the upcoming term
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(3 * 32 * 32, 1000)
self.fc2 = nn.Linear(1000, 1000)
self.fc3 = nn.Linear(1000, 1000)
self.fc4 = nn.Linear(1000, 1000)
self.fc5 = nn.Linear(1000, 1000)
self.fc6 = nn.Linear(1000, 10)
self.ReLU = nn.ReLU()
def forward(self, x):
x = self.fc1(x.view(-1, 3 * 32*32))
x = self.ReLU(x)
x = self.fc2(x)
x = self.ReLU(x)
x = self.fc3(x)
x = self.ReLU(x)
x = self.fc4(x)
x = self.ReLU(x)
x = self.fc5(x)
x = self.ReLU(x)
x = self.fc6(x)
return F.log_softmax(x, dim=1)
def train(model, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
def test(model, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
# data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
log_interval = 50
epochs = 7
model = Net()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(1, epochs + 1):
train(model, train_loader, optimizer, epoch)
test(model, test_loader)
Now we have somehow trained neural network and we are ready to perform compression of the weigths in the fully-connected layers.
Net
, but with some significant distinctions.
It takes as input parameters the instance of the class Net
and compression rank $r > 0$.
After that, this model has to compress all matrices $A$ in fully-connected layers with SVD using first $r$ singular vectors and singular values.
Pay attention to efficiently storing of compress representation of the layers.
Also forward
method of your new model has to be implemented in a way to use compressed representation of the fully-connected layers. In all other aspects it has to reproduce forward
method in the original non-compressed model (number of layers, activations, loss function etc).# Your solution is here
The norm is called absolute if $\|x\|=\| \lvert x \lvert \|$ holds for any vector $x$, where $x=(x_1,\dots,x_n)^T$ and $\lvert x \lvert = (\lvert x_1 \lvert,\dots, \lvert x_n \lvert)^T$. Give an example of a norm which is not absolute.
Write a function ranks_HOSVD(A, eps)
that calculates Tucker ranks of a d-dimensional tensor $A$ using High-Order SVD (HOSVD) algorithm, where eps
is the relative accuracy in the Frobenius norm between the approximated and the initial tensors. Details can be found here on Figure 4.3.
def ranks_HOSVD(A, eps):
return r #r should be a tuple of ranks r = (r1, r2, ..., rd)