We provide signatures of the functions that you have to implement. Make sure you follow the signatures defined, otherwise your coding solutions will not be graded.
Please submit the single Jupyter Notebook file, where only Python and Markdown/$\LaTeX$ are used. Any hand-written solutions inserted by photos or in any other way are prohibitive and will not be graded. If you will have any questions about using Markdown, ask them!
1.
2.
Hint: $\exp(S^{-1}AS) = \sum\limits_{k=0}^{\infty}\frac{(S^{-1}AS)^k}{k!} = S^{-1}\exp(A)S$, moreover $\exp(D) = \mathrm{diag}(\exp(d_1), \exp(d_2), ..., \exp(d_n))$ where $D = \mathrm{diag}(d_1, d_2, ..., d_n)$ - diagonal matrix with diagonal values $d_1, d_2, ..., d_n$
3.
# Your solution is here
1. (11 pts) Consider the following function
$$ F(U, V) = \frac{1}{2}\|X - UDV\|_F^2, $$where $X \in \mathbb{R}^{n \times n}$, $U \in \mathbb{R}^{n \times k}$, $V \in \mathbb{R}^{k \times n}$, $k < n$ and $D = diag(d_1, \ldots, d_k)$ is given diagonal matrix.
2. (4 pts) Derive analytical expression for the gradient and hessian of the function $f$:
$$ R(x) = \frac{(Ax, x)}{(x, x)}, $$where $A$ is a symmetric real matrix. Why the gradient of this function is important in NLA you will know in the lectures later.
# Your solution is here
In this problem we consider the neural network that performs classification of the dataset of images. Any neural network can be considered as composition of simple linear and non-linear functions. For example, a neural network with 3 layers can be represented as
$$f_3(f_2(f_1(x, w_1), w_2), w_3),$$where $x$ is input data (in our case it will be images) and $w_i, \; i =1,\dots,3$ are parameters that are going to be trained.
We will study the compression potential of neural network with simple architecture: alternating some numbers of linear and non-linear functions.
The main task in this problem is to study how the compression of fully-connected layers affects the test accuracy. Any fully-connected layer is represented as linear function $AX + B$, where $X$ is input matrix and $A, B$ are trainable matrices. Matrices $A$ in every layer are going to be compressed. The main result that you should get is the plot of dependence of test accuracy on the total number of parameters in the neural network.
import torch
import torchvision
import torchvision.transforms as transforms
from torchvision import datasets, transforms
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
batch_size = 100
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_loader = torch.utils.data.DataLoader(datasets.CIFAR10('./', train=True, download=True, transform=transform),
batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(datasets.CIFAR10('./', train=False, transform=transform),
batch_size=batch_size, shuffle=True)
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
Files already downloaded and verified
def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.figure(figsize=(20, 10))
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# get some random training images
dataiter = iter(train_loader)
images, labels = dataiter.next()
# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(8)))
frog plane dog cat frog deer dog cat
For simplicity and demonstration purposes of the neural network compression idea consider the architecture consisting of the only fully-connected layers and non-linear ReLU functions between them. To demonstrate compression effect, consider the dimension of the inner layers equals to 1000.
Below you see implementation of such neural network in PyTorch. More details about neural networks you will study in the Deep learning course in one of the upcoming term
class ResidualBlock(nn.Module):
def __init__(self, in_features, middle_features, activation=nn.ReLU()):
super().__init__()
self.layer1 = nn.Linear(in_features, middle_features)
self.layer2 = nn.Linear(middle_features, in_features)
self.activation = activation
def forward(self, x):
out = self.layer1(x)
out = self.activation(out)
out = self.layer2(out)
out = out + x
out = self.activation(out)
return out
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(3 * 32 * 32, 1000)
self.res1 = ResidualBlock(1000,1000)
self.res2 = ResidualBlock(1000,1000)
self.fc2 = nn.Linear(1000, 10)
self.ReLU = nn.ReLU()
def forward(self, x):
x = self.fc1(x.view(-1, 3 * 32*32))
x = self.ReLU(x)
x = self.res1(x)
x = self.res2(x)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
def train(model, train_loader, optimizer, epoch):
device = "cuda" if torch.cuda.is_available() else "cpu"
model.train()
model.to(device)
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
def test(model, test_loader):
model.eval()
test_loss = 0
correct = 0
device = "cuda" if torch.cuda.is_available() else "cpu"
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = model(data)
test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
correct += pred.eq(target.view_as(pred)).sum().item()
test_loss /= len(test_loader.dataset)
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss, correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
log_interval = 50
epochs = 7
model = Net()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
for epoch in range(1, epochs + 1):
train(model, train_loader, optimizer, epoch)
test(model, test_loader)
Train Epoch: 1 [0/50000 (0%)] Loss: 2.324131 Train Epoch: 1 [5000/50000 (10%)] Loss: 1.751483 Train Epoch: 1 [10000/50000 (20%)] Loss: 1.567799 Train Epoch: 1 [15000/50000 (30%)] Loss: 1.660581 Train Epoch: 1 [20000/50000 (40%)] Loss: 1.531731 Train Epoch: 1 [25000/50000 (50%)] Loss: 1.592496 Train Epoch: 1 [30000/50000 (60%)] Loss: 1.671944 Train Epoch: 1 [35000/50000 (70%)] Loss: 1.381795 Train Epoch: 1 [40000/50000 (80%)] Loss: 1.463287 Train Epoch: 1 [45000/50000 (90%)] Loss: 1.673219 Test set: Average loss: 1.4851, Accuracy: 4772/10000 (48%) Train Epoch: 2 [0/50000 (0%)] Loss: 1.170636 Train Epoch: 2 [5000/50000 (10%)] Loss: 1.359391 Train Epoch: 2 [10000/50000 (20%)] Loss: 1.563428 Train Epoch: 2 [15000/50000 (30%)] Loss: 1.337223 Train Epoch: 2 [20000/50000 (40%)] Loss: 1.377386 Train Epoch: 2 [25000/50000 (50%)] Loss: 1.385065 Train Epoch: 2 [30000/50000 (60%)] Loss: 1.359587 Train Epoch: 2 [35000/50000 (70%)] Loss: 1.294609 Train Epoch: 2 [40000/50000 (80%)] Loss: 1.692260 Train Epoch: 2 [45000/50000 (90%)] Loss: 1.325166 Test set: Average loss: 1.4136, Accuracy: 5059/10000 (51%) Train Epoch: 3 [0/50000 (0%)] Loss: 1.359571 Train Epoch: 3 [5000/50000 (10%)] Loss: 1.244511 Train Epoch: 3 [10000/50000 (20%)] Loss: 1.127326 Train Epoch: 3 [15000/50000 (30%)] Loss: 1.145635 Train Epoch: 3 [20000/50000 (40%)] Loss: 1.234360 Train Epoch: 3 [25000/50000 (50%)] Loss: 1.113500 Train Epoch: 3 [30000/50000 (60%)] Loss: 1.355322 Train Epoch: 3 [35000/50000 (70%)] Loss: 1.549211 Train Epoch: 3 [40000/50000 (80%)] Loss: 1.377020 Train Epoch: 3 [45000/50000 (90%)] Loss: 1.103844 Test set: Average loss: 1.3749, Accuracy: 5239/10000 (52%) Train Epoch: 4 [0/50000 (0%)] Loss: 1.193062 Train Epoch: 4 [5000/50000 (10%)] Loss: 1.273741 Train Epoch: 4 [10000/50000 (20%)] Loss: 1.404070 Train Epoch: 4 [15000/50000 (30%)] Loss: 1.080328 Train Epoch: 4 [20000/50000 (40%)] Loss: 1.224075 Train Epoch: 4 [25000/50000 (50%)] Loss: 1.333977 Train Epoch: 4 [30000/50000 (60%)] Loss: 1.072122 Train Epoch: 4 [35000/50000 (70%)] Loss: 1.021575 Train Epoch: 4 [40000/50000 (80%)] Loss: 1.187461 Train Epoch: 4 [45000/50000 (90%)] Loss: 1.171425 Test set: Average loss: 1.3591, Accuracy: 5292/10000 (53%) Train Epoch: 5 [0/50000 (0%)] Loss: 1.168614 Train Epoch: 5 [5000/50000 (10%)] Loss: 1.028332 Train Epoch: 5 [10000/50000 (20%)] Loss: 0.973166 Train Epoch: 5 [15000/50000 (30%)] Loss: 1.170292 Train Epoch: 5 [20000/50000 (40%)] Loss: 1.056305 Train Epoch: 5 [25000/50000 (50%)] Loss: 0.999553 Train Epoch: 5 [30000/50000 (60%)] Loss: 0.904085 Train Epoch: 5 [35000/50000 (70%)] Loss: 1.236696 Train Epoch: 5 [40000/50000 (80%)] Loss: 1.060990 Train Epoch: 5 [45000/50000 (90%)] Loss: 1.029037 Test set: Average loss: 1.3808, Accuracy: 5349/10000 (53%) Train Epoch: 6 [0/50000 (0%)] Loss: 0.827618 Train Epoch: 6 [5000/50000 (10%)] Loss: 0.812197 Train Epoch: 6 [10000/50000 (20%)] Loss: 0.885723 Train Epoch: 6 [15000/50000 (30%)] Loss: 0.901153 Train Epoch: 6 [20000/50000 (40%)] Loss: 0.758302 Train Epoch: 6 [25000/50000 (50%)] Loss: 0.850228 Train Epoch: 6 [30000/50000 (60%)] Loss: 1.227192 Train Epoch: 6 [35000/50000 (70%)] Loss: 0.877823 Train Epoch: 6 [40000/50000 (80%)] Loss: 1.058005 Train Epoch: 6 [45000/50000 (90%)] Loss: 1.079689 Test set: Average loss: 1.4376, Accuracy: 5333/10000 (53%) Train Epoch: 7 [0/50000 (0%)] Loss: 0.879806 Train Epoch: 7 [5000/50000 (10%)] Loss: 0.626541 Train Epoch: 7 [10000/50000 (20%)] Loss: 0.762323 Train Epoch: 7 [15000/50000 (30%)] Loss: 0.807961 Train Epoch: 7 [20000/50000 (40%)] Loss: 0.981910 Train Epoch: 7 [25000/50000 (50%)] Loss: 1.056234 Train Epoch: 7 [30000/50000 (60%)] Loss: 0.776608 Train Epoch: 7 [35000/50000 (70%)] Loss: 0.904691 Train Epoch: 7 [40000/50000 (80%)] Loss: 1.145846 Train Epoch: 7 [45000/50000 (90%)] Loss: 0.839523 Test set: Average loss: 1.4711, Accuracy: 5333/10000 (53%)
Now we have somehow trained neural network and we are ready to perform compression of the weigths in the fully-connected layers.
Net
, but with some significant distinctions.
It takes as input parameters the instance of the class Net
and compression rank $r > 0$.
After that, this model has to compress all matrices $A$ in fully-connected layers with SVD using first $r$ singular vectors and singular values.
Pay attention to efficient storing of compress representation of the layers.
Also forward
method of your new residual block has to be implemented in a way to use compressed representation of the fully-connected layers (inside it) in the most efficient way. In all other aspects it has to reproduce forward
method in the original non-compressed model (number of layers, activations, loss function etc).(5 pts) Plot dependence of test accuracy on the number of parameters in the compressed model. This number of parameters obviously depends on the compression rank $r$. Also plot dependence of time to compute inference on the compression rank $r$. Explain obtained results. To measure time, use %timeit with necessary parameters (examples of using this command see in lectures)
(5 pts) After such transformations, your model depends on the factors obtained from SVD. Therefore, these factors are also can be trained with the same gradient method during some number of epochs. This procedure is called fine-tuning. We ask you make this fine-tuning of your compressed model during from 1 to 5 epochs and compare the result test accuracy with the test accuracy that you get after compression. Explain the observed results.
Hint 1. Please, check that during fine-tuning, factors are updated. If they are not updated according to some reason, please check that their parameter requires_grad
is setted to True
.
Hint 2. You can use class torch.nn.Parameter
to convert factors to parameters of the network
# Your solution is here
Hermione Granger is well versed in all magical disciplines. In particular, she is a great expert in Numerical Linear Algebra and JAX.
She has invited Harry Potter to play the game.
Hermione chooses a number $r \in [1, 95]$ and two matrices $W_1 \in \mathbb{R}^{r \times 100}$ and $W_2 \in \mathbb{R}^{100 \times r}$. Harry can tell her any 100-dimensional vector $x$, and Hermione gives Harry the result of
$$ ||\sin(W_2 \cos(W_1 x))||^2_2,$$where trigonometric functions are applied element-wise. The result is the squared norm of a 100-dimensional vector.
Not to lose, Harry should guess what number $r$ Hermione has chosen. Harry knows the python language, but he is an absolute layman in algebra. Please, help him to get at least 95% correct answers!
Hint 1: SVD might help you, but use it wisely!
Hint 2: Suppose that a special magic allows you to compute gradients through automatic differentiation in JAX. You can also estimate gradients via finite differences, if you want it.
Hint 3: You can estimate the matrix rank using simple heuristics.
You should write code into the harry_answers
function. Good luck!
import jax.numpy as jnp
import jax
import numpy as np
import random
class Game:
def __init__(self, key):
# key is a jax.random.PRNGKey
self.key = key
return
def hermione_chooses_r_and_W(self):
self.r = random.randint(1, 95)
self.key, subkey = jax.random.split(self.key)
self.W1 = jax.random.uniform(subkey, (self.r, 100), maxval=100., minval=0., dtype=jnp.float32)
self.key, subkey = jax.random.split(self.key)
self.W2 = jax.random.uniform(subkey, (100, self.r), maxval=100., minval=0., dtype=jnp.float32)
def hermione_computes_function(self, x):
return jnp.sum(jnp.square(jnp.sin(self.W2 @ jnp.cos(self.W1 @ x))))
def harry_answers(self):
# <your code here>
# you shouldn't use self.r, self.W1, or self.W2
# you can call `hermione_computes_function` multiple times
r = 1 # for example
return r
def play(self, n_rounds, verbose=True):
# n_rounds: a number of rounds of the game
# verbose: print or not the result of each round
n_right_answers = 0
for _ in range(n_rounds):
self.hermione_chooses_r_and_W()
r = self.harry_answers()
if abs(r - self.r) == 0:
if verbose:
print("Good job! The true answer is {}, Harry's answer is {}!".format(self.r, r))
n_right_answers += 1
else:
if verbose:
print("Harry's answers is {}, but the true answer is {} :(".format(r, self.r))
if float(n_right_answers) / n_rounds > 0.95:
print('Well done: {}/{} right answers!'.format(n_right_answers, n_rounds))
else:
print('Only {}/{} right answers :( Work a little more and you will succeed!'.format(
n_right_answers, n_rounds))
key = jax.random.PRNGKey(713)
game = Game(key)
game.play(n_rounds=100, verbose=False)
/Users/alex/anaconda3/envs/pytorch/lib/python3.6/site-packages/jax/lib/xla_bridge.py:122: UserWarning: No GPU/TPU found, falling back to CPU. warnings.warn('No GPU/TPU found, falling back to CPU.')
Only 1/100 right answers :( Work a little more and you will succeed!