$30
Exercise 1: Convolutional Neural Networks
Let x = (x(i))i be a multivariate time series represented as a collection of one-dimensional signals. For simplicity of the following derivations, we consider these signals to be of infinite length. We feed x as input to a convolutional layer. The forward computation in this layer is given by:
for all l and t ∈ Z
It results in z = (z(j))j another multivariate time series again composed of a collection of output signals also assumed to be of infinite length. Convolution filters w(ij) have to be learned from the data. After passing the data through the convolutional layer, the neural network output is given as y = f(z) where f is some top-layer function assumed to be differentiable. To learn the model, parameters gradients need to be computed.
(a) Express the gradient ∂y/∂w as a function of the input x and of the gradient ∂y/∂z.
Exercise 2: Recursive Neural Networks (10+10 P)
Consider a recursive neural network that applies some function φ recursively from the leaves to the root of a binary parse tree. The function φ : Rd × Rd × Rh → Rd takes two incoming nodes ai,aj ∈ Rd and some parameter vector θ ∈ Rh as input, and produces an output ak ∈ Rd. Once we have reached the root node, we apply a function f : Rd → R that produces some real-valued prediction. We assume that both φ and f are differentiable with their inputs.
Consider the sentence ‘the cat sat on the mat’, that is parsed as:
((the,cat),(sat,(on,(the,mat))))
The leaf nodes of the parsing tree are represented by word embeddings athe,acat,··· ∈ Rd.
(a) Draw the computational graph associated to the application of the recursive neural network to this sentence, and write down the set of equations, e.g.
a1 = φ(athe,acat,θ) a2 = φ(athe,amat,θ) a3 = φ(aon,a2,θ)
...
y = f(a5)
(b) Express the total derivative dy/dθ (taking into account all direct and indirect dependencies on the parameter
θ) in terms of local derivatives (relating adjacent quantities in the computational graph).
(Hint: you can use for this the chain rule for derivatives that states that for some function h(g1(t),...,gN(t)) we have:
where d(·) and ∂(·) denote the total and partial derivatives respectively.)
Exercise 3: Graph Neural Networks (10+10 P)
Graph neural networks are a fairly flexible class of neural networks that can sometimes be seen as a generalization of convolutional neural networks and recursive neural networks. We would like to study the equivalence of graph neural network with other neural networks. We will consider in this exercise graph neural network that map two consecutive layers using the equation:
Ht+1 = ρ(ΛHtW)
where ρ denotes the ReLU function. The adjacency matrix Λ is of size N × N with value 1 when there is a connection, and 0 otherwise. The matrix of parameters W is of size d × d, where d is the number of dimensions used to represent each node. We also assume that the initial state H0 is a matrix filled with ones.
(a) Consider first the following undirected graph:
Because the graph is unlabeled, the nodes can only be distinguished by their connectivity to other nodes. Consider the case where the graph neural network has depth 2. Depict the multiple trees formed by viewing the graph neural network as a collection of recursive neural networks.
(b) Consider now the following infinite lattice graph:
which is like in the previous example undirected and unlabeled. We consider again the case where the graph neural network has depth 2. Show that the latter can be implemented as a 2D convolutional neural network with four convolution layers and two ReLU layer, i.e. give the sequence of layers and their parameters.
Exercise 4: Programming (50 P)
Download the programming files on ISIS and follow the instructions.
Structured Neural Networks
In this homework, we train a collection of neural networks including a convolutional neural network on the MNIST dataset, and a graph neural network on some graph classification task.
We first consider the convolutional neural network, which we apply in the following to the MNIST data.
In [2]: transform = transforms.Compose(
[transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
Xr,Tr = trainset.data.float().view(-1,1,28,28)/127.5-1,trainset.targets Xt,Tt = testset.data.float().view(-1,1,28,28)/127.5-1,testset.targets
We consider for this dataset a convolution network made of four convolutions and two pooling layers.
The network is wrapped in the class utils.NNClassifier , which exposes scikit-learn-like functions such as fit() and predict() . To evaluate the convolutional neural network, we also consider two simpler baselines: a one-layer linear network, and standard fully-connected network composed of two layers.
Evaluating the convolutional neural network (15 P)
We now proceed with the comparision of these three classifiers.
Task:
Train each classifier for 5 epochs and print the classification accuracy on the training and test data (i.e. the fraction of the examples that are correctly classified). To avoid running out of memory, predict the training and test accuracy only based on the 2500 first examples of the training and test set respectively.
In [5]: for name,cl in [('linear',lin),('full',fc),('conv',cnn)]:
# ------------------------------------
# TODO: Replace by your code
# -----------------------------------import solution
errtr,errtt = solution.analyze(cl,Xr,Tr,Xt,Tt)
# ------------------------------------
print('%10s train: %.3f test: %.3f'%(name,errtr,errtt))
linear train: 0.910 test: 0.878 full train: 0.966 test: 0.954 conv train: 0.990 test: 0.982
We observe that the convolutional neural network reaches the higest accuracy with less than 2% of misclassified digits on the test data.
Confidently predicted digits (15 P)
We now ask whether some digits are easier to predict than others for the convolutional neural network. For this, we observe that the neural network produces at its output scores yc for each class c. These scores can be converted to a class probability using the softargmax (also called softmax) function:
pc
Task:
Find for the convolutional network the data points in the test set that are predicted with the highest probability (the lowest being random guessing). To avoid numerical unstability, your implementation should work in the log-probability domain and make use of numerically stable functions of numpy/scipy such as logsumexp.
We observe that the most confident digits are thick and prototypical. Interestingly, the highest confidence digits are all from the class "3". The low-confidence digits are on the other hand thiner, and are often also more difficult to predict for a human.
Graph Neural Network (20 P)
We consider a graph neural network (GNN) that takes as input graphs of size m given by their adjacency matrix A and which is composed of the following four layers:
H0 = U
H1 = ρ(ΛH0W)
H2 = ρ(ΛH1W)
H3 = ρ(ΛH2W) y = 1⊤H3V
U is a matrix of size m × h, W is a matrix of size h × h, V is a matrix of size h × 3 and Λ is the normalized Laplacian associated to the graph adjacency matrix A (i.e. Λ = D−0.5AD−0.5 where D is a diagonal matrix containing the degree of each node), and ρ(t) = max(0,t) is the rectified linear unit that applies element-wise.
Task:
Implement the forward function of the GNN. It should take as input a minibatch of adjacency matrices A (given as a 3-dimensional tensor of dimensions (minibatch_size × number_nodes × number_nodes)) and return a matrix of size minibatch_size × 3 representing the scores for each example and predicted class.
(Note: in your implementation use array operations instead of looping over all individual examples of the minibatch.)
The graph neural network is now tested on a simple graph classification task where the three classes correspond to star-shaped, chain-shaped and random-shaped graphs. Because the GNN is more difficult to optimize and the dataset is smaller, we train the network for 500 epochs. We compare the GNN with a simple fully-connected network built directly on the adjacency matrix.
In [8]: Ar,Tr,At,Tt = utils.graphdata()
torch.manual_seed(0)
dnn = utils.NNClassifier(nn.Sequential(nn.Linear( 225,512), nn.ReLU(),nn.Linear(512,
3)),flat=True) torch.manual_seed(0) gnn = utils.NNClassifier(GNN(15,25,3))
for name,net in [('DNN',dnn),('GNN',gnn)]:
net.fit(Ar,Tr,lr=0.01,epochs=500)
Yr = net.predict(Ar) Yt = net.predict(At) acctr = (Yr.max(dim=1)[1] == Tr).data.numpy().mean() acctt = (Yt.max(dim=1)[1] == Tt).data.numpy().mean() print('name: %10s train: %.3f test: %.3f'%(name,acctr,acctt))
name: DNN train: 1.000 test: 0.829 name: GNN train: 1.000 test: 0.965
We observe that both networks are able to perfectly classify the training data, however, due to its particular structure, the graph neural network generalizes better to new data points.