We provide signatures of the functions that you have to implement. Make sure you follow the signatures defined, otherwise your coding solutions will not be graded.
Many daily-life events can be described using time series: $x_t$, $t \in \{0, \ldots, n\}$, e.g. closing price of a stock on day $t$ or number of users on a website in hour $t$.
Below you are given time series example $x \in \mathbb{R}^n$:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('https://nla.skoltech.ru/homeworks/files/data.csv')
x = data['Val'].values
plt.plot(x)
[<matplotlib.lines.Line2D at 0x18a5e1bb448>]
Raw time dependent data usually looks rather unpretty. However, it can be often considered as a sum of some trend + some random noise. For example, a periodic trend is easily observed in the series above.
1. Your first task is to find truly periodic series which will best approximate the original data $x$.
Let $z \in \mathbb{R}^m$ denotes one period of some given periodic data $x_{per} \in \mathbb{R}^n$ (assuming $n$ is miltiple of $m$), i.e.:
$$x_{per} = (z,z,\ldots,z),$$or in the matrix form:
$$Az = x_{per},$$with "tall" matrix $A \in \mathbb{R}^{n\times m}$, $n > m$.
The actual data $x$ is non-periodic, hence the solution (in appropriate sense) of the following overdetermined linear system is required (think why):
$$Az = x.$$numpy.vstack()
function. Describe how do you choose $m$. Also, suggest what are the data you are working with?def build_A(n, m):
'''
Input: positive integers n, m
Output: np.array of size (n, m)
'''
# Pure loops are strongly prohibited!
# Comprehension expressions are possible.
raise NotImplementedError()
A = build_A(x.shape[0], m)
def build_pseudoinverse(A):
'''
Input: np.array of size (n, m)
Output: np.array of size (m, n)
'''
raise NotImplementedError()
A_inv = build_pseudoinverse(A)
z = A_inv @ x
# Place for your plots
If your solution is correct, you see that the obtained periodic approximation is non-smooth.
To ensure the smoothness in our periodic approximation, consider the following regularized least-squares problem:
$$L(z) = ||Az - x||^2 + \lambda R(z) \to \min_z ,$$where $R(z)$ is some regularization term and $\lambda > 0$. The term $R(z)$ reflects our additional assumptions on the smoothness of the periodic approximation $z$. One of the possible form of this term is:
$$ R(z) = \sum_{i=1}^m (z_{i+1} - z_i)^2.$$Another form of the same regularization term is:
$$R(z) = \|Dz\|_2^2.$$Intuitevely, the regularization term aims to penalize "sharp" series, and the parameter $\lambda$ provides a trade-off between approximation quality (the first term) and the smoothness degree (the second term). Note, the case $\lambda = 0$ corresponds to the previously obtained solution $z^*$ (why?).
2. Your second task is to find the best smooth periodic approximation $z^*_s$ of the original data $x$.
def build_D(m):
'''
Input: positive integer m
Output: np.array of size (m, m)
'''
raise NotImplementedError()
D = build_D(z.shape[0])
# Your derivation is here!
lmbda = 1. # You are supposed to play with parameter $\lambda$ during your study
# Construct system
# z_s = np.linalg.solve(matrix, rhs)
# Your solution is here
3. The final step is to build a model for predicting time series.
- We want to predict $x_{k+1}$ based on $M$ previous observations.
- Consider the following (linear) auto-regression (AR) model:
$$\hat{x}_{k+1} = \theta_1 x_k + \theta_2 x_{k-1} + \ldots + \theta_M x_{k-M+1},$$with $M$ parameters $\theta_1, \ldots, \theta_M$ to be determined.
- We fit the parameters by minimizing squared norm of the residual (again!):
$$(\hat{x}_{M+1} - x_{M+1})^2 + \ldots + (\hat{x}_n - x_n)^2 \to \min_{\mathbf{\theta}}.$$- This problem can be reformulated in a familiar form:
$$||X\theta - x||_2^2 \to \min_{\theta}.$$NumPy/SciPy
stack. def build_X(x, M):
'''
Input: np.array of size n, positive integer M
Output: np.array of size (n - M, M)
'''
raise NotImplementedError()
Note, that the previously obtained smooth periodic approximation $z_s$ by itself can be regarded as a predictive model. However, we want to include the noise term in the model too, by fitting it with AR model:
x_n = x - A @ z_s # Extract noise
train_x = x_n[:x_n.shape[0] // 2] # Train data to fit the model
test_x = x_n[x_n.shape[0] // 2:] # Test data to validate the model
M = 3 # You are supposed to find the best parameter $M$ during your study
X_train = build_X(x_train, M)
numpy.linalg.qr()
function.def build_pseudoinverse_QR(A):
'''
Input: np.array of size (n - M, M)
Output: np.array of size (M, n - M)
'''
raise NotImplementedError()
X_inv = build_pseudoinverse_QR(X_train)
theta = X_inv @ train_x[M:]
test_X = build_X(test_x)
predicted = test_X @ theta
# Plotting
# Your solution is here
1.
2.
# Your solution is here
1. (5 pts) Consider the following function
$$ F(U, V) = \frac{1}{2}\|X - UV\|_F^2, $$where $X \in \mathbb{R}^{n \times n}$, $U \in \mathbb{R}^{n \times k}$ and $V \in \mathbb{R}^{k \times n}$ and $k < n$.
2. (2 pts) Derive analytical expression for the gradient of the function $f$:
$$ R(x) = \frac{(Ax, x)}{(x, x)}, $$where $A$ is a symmetric real matrix. Why the gradient of this function is important in NLA you will know in the lectures later.
3. (8 pts) Consider the following function $f$
$$f(w) = \log\det\left(\sum_{i=1}^m w_i x_i x_i^{\top}\right),$$where $x_i, \; i = 1,\dots,m$ are given column vectors.
einsum
function.# Your solution is here
In this problem we consider the neural network that performs classification of the dataset of images. Any neural network can be considered as composition of simple linear and non-linear functions. For example, a neural network with 3 layers can be represented as
$$f_3(f_2(f_1(x, w_1), w_2), w_3),$$where $x$ is input data (in our case it will be images) and $w_i, \; i =1,\dots,3$ are parameters that are going to be trained.
We will study the compression potential of neural network with simple architecture: alternating some numbers of linear and non-linear functions.
The main task in this problem is to study how the compression of fully-connected layers affects the test accuracy. Any fully-connected layer is represented as linear function $AX + B$, where $X$ is input matrix and $A, B$ are trainable matrices. Matrices $A$ in every layer are going to be compressed. The main result that you should get is the plot of dependence of test accuracy on the total number of parameters in the neural network.
import torch
import torchvision
import torchvision.transforms as transforms
from torchvision import datasets, transforms
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
batch_size = 100
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_loader = torch.utils.data.DataLoader(datasets.CIFAR10('./', train=True, download=True, transform=transform),
batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(datasets.CIFAR10('./', train=False, transform=transform),
batch_size=batch_size, shuffle=True)
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.figure(figsize=(20, 10))
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# get some random training images
dataiter = iter(train_loader)
images, labels = dataiter.next()
# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(8)))
dog horse truck plane plane horse horse dog
For simplicity and demonstration purposes of the neural network compression idea consider the architecture consisting of the only fully-connected layers and non-linear ReLU functions between them. To demonstrate compression effect, consider the dimension of the inner layers equals to 1000.
Below you see implementation of such neural network in PyTorch. More details about neural networks you will study in the Deep learning course in one of the upcoming term
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(3 * 32 * 32, 1000)
self.fc2 = nn.Linear(1000, 1000)
self.fc3 = nn.Linear(1000, 1000)
self.fc4 = nn.Linear(1000, 1000)
self.fc5 = nn.Linear(1000, 1000)
self.fc6 = nn.Linear(1000, 10)
self.ReLU = nn.ReLU()
def forward(self, x):
x = self.fc1(x.view(-1, 3 * 32*32))
x = self.ReLU(x)
x = self.fc2(x)
x = self.ReLU(x)
x = self.fc3(x)
x = self.ReLU(x)
x = self.fc4(x)
x = self.ReLU(x)
x = self.fc5(x)
x = self.ReLU(x)
x = self.fc6(x)
return F.log_softmax(x, dim=1)
def train(model, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
def test(model, test_loader):
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_loader:
# data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
log_interval = 50
epochs = 7
model = Net()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(1, epochs + 1):
train(model, train_loader, optimizer, epoch)
test(model, test_loader)
Train Epoch: 1 [0/50000 (0%)] Loss: 2.303509 Train Epoch: 1 [5000/50000 (10%)] Loss: 2.017663 Train Epoch: 1 [10000/50000 (20%)] Loss: 2.012401 Train Epoch: 1 [15000/50000 (30%)] Loss: 1.880147 Train Epoch: 1 [20000/50000 (40%)] Loss: 1.677673 Train Epoch: 1 [25000/50000 (50%)] Loss: 1.762426 Train Epoch: 1 [30000/50000 (60%)] Loss: 1.699280 Train Epoch: 1 [35000/50000 (70%)] Loss: 1.811280 Train Epoch: 1 [40000/50000 (80%)] Loss: 1.371416 Train Epoch: 1 [45000/50000 (90%)] Loss: 1.669703 Test set: Average loss: 1.5764, Accuracy: 4378/10000 (44%) Train Epoch: 2 [0/50000 (0%)] Loss: 1.346880 Train Epoch: 2 [5000/50000 (10%)] Loss: 1.285201 Train Epoch: 2 [10000/50000 (20%)] Loss: 1.515415 Train Epoch: 2 [15000/50000 (30%)] Loss: 1.770635 Train Epoch: 2 [20000/50000 (40%)] Loss: 1.582595 Train Epoch: 2 [25000/50000 (50%)] Loss: 1.431566 Train Epoch: 2 [30000/50000 (60%)] Loss: 1.358462 Train Epoch: 2 [35000/50000 (70%)] Loss: 1.410893 Train Epoch: 2 [40000/50000 (80%)] Loss: 1.661417 Train Epoch: 2 [45000/50000 (90%)] Loss: 1.529408 Test set: Average loss: 1.4822, Accuracy: 4786/10000 (48%) Train Epoch: 3 [0/50000 (0%)] Loss: 1.470752 Train Epoch: 3 [5000/50000 (10%)] Loss: 1.197014 Train Epoch: 3 [10000/50000 (20%)] Loss: 1.482764 Train Epoch: 3 [15000/50000 (30%)] Loss: 1.230034 Train Epoch: 3 [20000/50000 (40%)] Loss: 1.359152 Train Epoch: 3 [25000/50000 (50%)] Loss: 1.512282 Train Epoch: 3 [30000/50000 (60%)] Loss: 1.559847 Train Epoch: 3 [35000/50000 (70%)] Loss: 1.481677 Train Epoch: 3 [40000/50000 (80%)] Loss: 1.335407 Train Epoch: 3 [45000/50000 (90%)] Loss: 1.456683 Test set: Average loss: 1.4086, Accuracy: 5091/10000 (51%) Train Epoch: 4 [0/50000 (0%)] Loss: 1.316813 Train Epoch: 4 [5000/50000 (10%)] Loss: 1.349132 Train Epoch: 4 [10000/50000 (20%)] Loss: 1.344821 Train Epoch: 4 [15000/50000 (30%)] Loss: 1.472614 Train Epoch: 4 [20000/50000 (40%)] Loss: 1.342841 Train Epoch: 4 [25000/50000 (50%)] Loss: 1.232694 Train Epoch: 4 [30000/50000 (60%)] Loss: 1.187062 Train Epoch: 4 [35000/50000 (70%)] Loss: 1.362742 Train Epoch: 4 [40000/50000 (80%)] Loss: 1.120951 Train Epoch: 4 [45000/50000 (90%)] Loss: 1.142083 Test set: Average loss: 1.4046, Accuracy: 5173/10000 (52%) Train Epoch: 5 [0/50000 (0%)] Loss: 1.248556 Train Epoch: 5 [5000/50000 (10%)] Loss: 1.106759 Train Epoch: 5 [10000/50000 (20%)] Loss: 1.129976 Train Epoch: 5 [15000/50000 (30%)] Loss: 1.121431 Train Epoch: 5 [20000/50000 (40%)] Loss: 1.309171 Train Epoch: 5 [25000/50000 (50%)] Loss: 1.135936 Train Epoch: 5 [30000/50000 (60%)] Loss: 1.333919 Train Epoch: 5 [35000/50000 (70%)] Loss: 1.103451 Train Epoch: 5 [40000/50000 (80%)] Loss: 1.062910 Train Epoch: 5 [45000/50000 (90%)] Loss: 1.201752 Test set: Average loss: 1.3646, Accuracy: 5323/10000 (53%) Train Epoch: 6 [0/50000 (0%)] Loss: 1.094037 Train Epoch: 6 [5000/50000 (10%)] Loss: 0.929148 Train Epoch: 6 [10000/50000 (20%)] Loss: 1.146780 Train Epoch: 6 [15000/50000 (30%)] Loss: 1.208604 Train Epoch: 6 [20000/50000 (40%)] Loss: 1.335437 Train Epoch: 6 [25000/50000 (50%)] Loss: 1.109769 Train Epoch: 6 [30000/50000 (60%)] Loss: 1.153895 Train Epoch: 6 [35000/50000 (70%)] Loss: 1.133219 Train Epoch: 6 [40000/50000 (80%)] Loss: 1.035433 Train Epoch: 6 [45000/50000 (90%)] Loss: 1.206637 Test set: Average loss: 1.3586, Accuracy: 5422/10000 (54%) Train Epoch: 7 [0/50000 (0%)] Loss: 0.927365 Train Epoch: 7 [5000/50000 (10%)] Loss: 1.119912 Train Epoch: 7 [10000/50000 (20%)] Loss: 0.962684 Train Epoch: 7 [15000/50000 (30%)] Loss: 1.104071 Train Epoch: 7 [20000/50000 (40%)] Loss: 1.173813 Train Epoch: 7 [25000/50000 (50%)] Loss: 0.904804 Train Epoch: 7 [30000/50000 (60%)] Loss: 1.081464 Train Epoch: 7 [35000/50000 (70%)] Loss: 0.863629 Train Epoch: 7 [40000/50000 (80%)] Loss: 1.030450 Train Epoch: 7 [45000/50000 (90%)] Loss: 1.005699 Test set: Average loss: 1.3439, Accuracy: 5435/10000 (54%)
Now we have somehow trained neural network and we are ready to perform compression of the weigths in the fully-connected layers.
Net
, but with some significant distinctions.
It takes as input parameters the instance of the class Net
and compression rank $r > 0$.
After that, this model has to compress all matrices $A$ in fully-connected layers with SVD using first $r$ singular vectors and singular values.
Pay attention to efficiently storing of compress representation of the layers.
Also forward
method of your new model has to be implemented in a way to use compressed representation of the fully-connected layers. In all other aspects it has to reproduce forward
method in the original non-compressed model (number of layers, activations, loss function etc).# Your solution is here
The norm is called absolute if $\|x\|=\| \lvert x \lvert \|$ holds for any vector $x$, where $x=(x_1,\dots,x_n)^T$ and $\lvert x \lvert = (\lvert x_1 \lvert,\dots, \lvert x_n \lvert)^T$. Give an example of a norm which is not absolute.
Write a function ranks_HOSVD(A, eps)
that calculates Tucker ranks of a d-dimensional tensor $A$ using High-Order SVD (HOSVD) algorithm, where eps
is the relative accuracy in the Frobenius norm between the approximated and the initial tensors. Details can be found here on Figure 4.3.
def ranks_HOSVD(A, eps):
return r #r should be a tuple of ranks r = (r1, r2, ..., rd)