Starting from:

$35

DD2424 - Assignment 1 Solved


In this assignment you will train and test a one layer network with multi¬ple outputs to classify images from the CIFAR-10 dataset. You will train the network using mini-batch gradient descent applied to a cost function that computes the cross-entropy loss of the classifier applied to the labelled training data and an L2 regularization term on the weight matrix.
Background 1: Mathematical background
The mathematical details of the network are as follows. Given an input vector, x, of size d x 1 our classifier outputs a vector of probabilities, p (K x 1), for each possible output label:
s = Wx + b    (1)
p = SOFTMAX(s)    (2)
where the matrix W has size K x d, the vector b is K x 1 and SOFTMAX is defined as
exp(s)
SOFTMAX(s) =      (3)
1T exp(s)
The predicted class corresponds to the label with the highest probability: k∗ = arg max {p1,. . . ,pK}    (4)
1≤k≤K
a) Classification function    b) Loss function
Figure 1: Computational graph of the classification and loss function that is ap¬plied to each input x in this assignment.
The parameters W and b of our classifier are what we have to learn by exploiting labelled training data. Let D = {(xi, yi)}n i=1, with each yi E {1, . . . , K} and xi E Rd, represent our labelled training data. In the lectures we have described how to set the parameters by minimizing the cross-entropy loss plus a regularization term on W. Mathematically this cost function is
1  ~ J(D,A,W,b) = |D|
x,y∈D    ~lcross(xi, yi, W, b) + A
i,j    W 2    (5)
ij
 
where
lcross(x, y, W, b) = - log(py)    (6)
and p has been calculated using equations (1, 2). (Note if the label is encoded by a one-hot representation then the cross-entropy loss is defined as - log(yTp).) The optimization problem we have to solve is
W*, b* = arg min J(D,λ,W,b)    (7)
W,b
In this assignment (as described in the lectures) we will solve this optimiza¬tion problem via mini-batch gradient descent.
For mini-batch gradient descent we begin with a sensible random initial¬ization of the parameters W, b and we then update our estimate for the parameters with
W(t+1) = W(t) - η ∂J(B(t+1), λ, W, b) I
∂W    I W=W() ,b=b(t)
b(t+1) = b(t) - η ∂J(B(t+1), λ, W, b) I
∂b
I W=W(t) ,b=b(t)
where η is the learning rate and B(t+1) is called a mini-batch and is a random subset of the training data D and
∂J(B(t+1), λ, W, b)    1     ~    ∂lcross(x, y, W, b)
=     + 2λW (10) ∂W|B(t+1)| (x,y)E13(t+1)∂W
∂J(B(t+1), λ, W, b)    1  ~
∂lcross(x,
y, W, b)    (11)
|B(t+1)|    ∂b
(x,y)E13(t+1)
∂b    

To compute the relevant gradients for the mini-batch, we then have to com¬pute the gradient of the loss w.r.t. each training example in the mini-batch. You should refer to the lecture notes for the explicit description of how to compute these gradients.
Before Starting
I assume that you will complete the assignment in Matlab. You can complete the assignment in another programming language. If you do though I will not answer programming specific questions and you will also probably have to find a way to display, plot and graph your results.
Besides invoking Matlab commands, you will be required to run a few oper- 
ating system commands. For these commands I will assume your computer’s
 
Figure 2: Computational graph of the cost function applied to a mini-batch con¬taining one training example x. If you have a mini-batch of size greater than one, then the loss computations are repeated for each entry in the mini-batch (as in the above graph), but the regularization term is only computed once.
operating system is either linux or unix. If otherwise, you’ll have to fend for 
yourself. But all the non-Matlab commands needed are more-or-less trivial.
The notes for this assignment, and those to follow, will give you pointers about which Matlab commands to use. However, I will not give detailed explanations about their usage. I assume you have some previous experience with Matlab, are aware of many of the in-built functions, how to manipulate vectors and matrices and how to write your own functions etc. Keep in mind the function help can be called to obtain information about particular functions. So for example
>> help plot
will display information about the Matlab command plot. Background 2: Getting Started
Set up your environment
Create a new directory to hold all the Matlab files you will write for this course:
$ mkdir DirName
$ cd DirName
$ mkdir Datasets
$ mkdir Result Pics
Download the CIFAR-10 dataset stored in its Matlab format from this link. Move the cifar-10-matlab.tar.gz file to the Datasets directory you have just created, untar the file and then move up to the parent directory. Also download the file montage.m from the KTH Social website and move it to DirName.
 
$ mv montage.m DirName/
$ mv cifar-10-matlab.tar.gz DirName/Datasets $ cd DirName/Datasets
$ tar xvfz cifar-10-matlab.tar.gz
$ cd ..
Background 3: Useful Display Function
You have copied a function called montage.m, which is a slightly modified version of the function available at:
http://www.mathworks.com/matlabcentral/fileexchange/22387
This is a useful function as it allows you to efficiently view the images in a directory or in a Matlab array or a cell array. To look at some of the images from the CIFAR-10 dataset you can use the following set of commands:
>> addpath DirName/Datasets/cifar-10-batches-mat/; >> A = load(’data batch 1.mat’);
>> I = reshape(A.data’, 32, 32, 3, 10000); >> I = permute(I, [2, 1, 3, 4]);
>> montage(I(:, :, :, 1:500), ’Size’, [5,5]);
This sequence of commands tells Matlab to add the directory of the CIFAR-10 dataset to its path. Then it loads one of the .mat containing image and label data. You access the image data with the command A.data. It has size 10000 x 3072. Each row of A.data corresponds to an image of size 32 x 32 x 3 that has been flattened into a row vector. We can re-arrange A.data into an array format expected by montage (32x32x3x10000) using the command reshape. After reshaping the array, the rows and columns of each image still need to be permuted and this is achieved with the permute command. Now you have an array format montage expects. In the above code we just view the first 500 images. Use help to find out the different ways montage can be called.
You have looked at some of the CIFAR-10 images. Now it is time to start writing some code.
Exercise 1: Training a multi-linear classifier
For this assignment you will just use data in the file data batch 1.mat for training, the file data batch 2.mat for validation and the file test batch.mat for testing. Create a file Assignment1.m. In this file you will write the code
 
for this assignment and the necessary (sub-)functions. Here are my recom-mendations for which functions to write and the order in which to write them:
1. Write a function that reads in the data from a CIFAR-10 batch file and returns the image and label data in separate files. Make sure to convert your image data to single or double format and divide it by 255 to convert the data to be between 0 and 1. I would suggest the function has the following input and outputs
function [X, Y, y] = LoadBatch(filename) where
•    X contains the image pixel data, has size dxN, is of type double or single and has entries between 0 and 1. N is the number of images (10000) and d the dimensionality of each image (3072=32x32x3).
•    Y is KxN (K= # of labels = 10) and contains the one-hot representation of the label for each image.
•    y is a vector of length N containing the label for each image. A note of caution. CIFAR-10 encodes the labels as integers between 0-9 but Matlab indexes matrices and vectors starting at 1. Therefore it may be easier to encode the labels between 1-10.
This file will not be long. You just need to call A = load(fname); and then rearrange and store the values in A.data and A.labels.
Top-level: Read in and store the training, validation and test data.
2.    Top-Level: After reading in the data, you can initialize the parame¬ters of the model W and b as you now know what size they should be. W has size Kxd and b is Kx1. Initialize each entry to have Gaussian random values with zero mean and standard deviation .01. You should use the Matlab function randn to create this data.
3.    Write a function that evaluates the network function, i.e. equations (1, 2), on multiple images and returns the results. I would suggest the function has the following form
function P = EvaluateClassifier(X, W, b)
where
•    each column of X corresponds to an image and it has size dxn.
•    W and b are the parameters of the network.
•    each column of P contains the probability for each label for the image in the corresponding column of X. P has size Kxn.
 
Top-level: Check the function runs on a subset of the training data given a random initialization of the network’s parameters:
P = EvaluateClassifier(trainX(:, 1:100), W, b).
4. Write the function that computes the cost function given by equation (5) for a set of images. I suggest the function has the following inputs and outputs
function J = ComputeCost(X, Y, W, b, lambda) where
•    each column of X corresponds to an image and X has size dxn.
•    each column of Y (Kxn) is the one-hot ground truth label for the corre-sponding column of X or Y is the (1xn) vector of ground truth labels.
•    J is a scalar corresponding to the sum of the loss of the network’s predictions for the images in X relative to the ground truth labels and the regularization term on W.
5. Write a function that computes the accuracy of the network’s predic¬tions given by equation (4) on a set of data. Remember the accuracy of a classifier for a given set of examples is the percentage of examples for which it gets the correct answer. I suggest the function has the following inputs and outputs
function acc = ComputeAccuracy(X, y, W, b) where
•    each column of X corresponds to an image and X has size dxn.
•    Y is the vector of ground truth labels of length n.
•    acc is a scalar value containing the accuracy.
6. Write the function that evaluates, for a mini-batch, the gradients of the cost function w.r.t. W and b, that is equations (10, 11). I suggest the function has the form
function [grad W, grad b] = ComputeGradients(X, Y, P, W, lambda)
where
•    each column of X corresponds to an image and it has size dxn.
•    each column of Y (Kxn) is the one-hot ground truth label for the cor¬responding column of X.
•    each column of P contains the probability for each label for the image in the corresponding column of X. P has size Kxn.
•    grad W is the gradient matrix of the cost J relative to W and has size Kxd.
 
• grad b is the gradient vector of the cost J relative to b and has size K×1.
Everyone makes mistakes when computing gradients. Therefore you must always check your analytic gradient computations against nu¬merical estimations of the gradients! Download code from the Canvas webpage that computes the gradient vectors numerically. Note there are two versions 1) a slower but more accurate version based on the centered difference formula and 2) a faster but less accurate based on the finite difference method. You will probably have to make small changes to the functions to make them compatible with your code. It will take some time to run the numerical gradient calculations as your network has 32 × 32 × 3 × 10 different parameters in W. Initially, you should just perform your checks on mini-batches of size 1 and with no regularization (lambda=0). Afterwards you can increase the size of the mini-batch and include regularization into the computations. Another trick is that you should send in a reduced dimensionality version of trainX and W, so instead of
[ngrad b, ngrad W] = ComputeGradsNumSlow(trainX(:, 1), trainY(:, 1), W, b, lambda, 1e-6);
you can compute the gradients numerically for smaller inputs (the first 100 dimensions of the first training example)
[ngrad b, ngrad W] = ComputeGradsNumSlow(trainX(1:100, 1), trainY(:, 1), W(1:100, :), b, lambda, 1e-6);
You should then also compute your analytical gradients on this reduced version of the input data with reduced dimensionality. This will speed up computations and also reduce the risk of numerical precision issues (very possible when the full W is initialized with very small numbers and trainX also contains small numbers).
You could compare the numerically and analytically computed gra¬dient vectors (matrices) by examining their absolute differences and declaring, if all these absolutes difference are small (<1e-6), then they have produced the same result. However, when the gradient vectors have small values this approach may fail. A more reliable method is to compute the relative error between a numerically computed gradient value gn and an analytically computed gradient value ga
|ga − gn|
max(eps, |ga| + |gn|) where eps a very small positive number
and check this is small. There are potentially more issues that can plague numerical gradient checking (especially when you start to train deep rectifier networks), so I suggest you read the relevant section of the Additional material for lecture 3 from Standford’s course Con¬volutional Neural Networks for Visual Recognition for a more thorough exposition especially for the subsequent assignments.
 
Do not continue with the rest of this assignment until you are sure your analytic gradient code is correct. If you are having problems, set the seed of the random number generator with the command rng to ensure at each test W and b have the same values and double/triple check that you have a correct implementation of the gradient equations from the lecture notes.
7. Once you have the gradient computations debugged you are now ready to write the code to perform the mini-batch gradient descent algorithm to learn the network’s parameters where the updates are defined in equations (8, 9). You have a couple of parameters controlling the learning algorithm (for this assignment you will just implement the most vanilla version of the mini-batch gradient descent algorithm, with no adaptive tuning of the learning rate or momentum terms):
•    n batch the size of the mini-batches
•    eta the learning rate
•    n epochs the number of runs through the whole training set.
As the images in the CIFAR-10 dataset are in random order, the eas¬iest to generate each mini-batch is to just run through the images sequentially. Let n batch be the number of images in a mini-batch. Then for one epoch (a complete run through all the training images), you can generate the set of mini-batches with this snippet of code:
for j=1:N/n batch
j start = (j-1)*n batch + 1;
j end = j*n batch;
inds = j start:j end;
Xbatch = Xtrain(:, j start:j end); Ybatch = Ytrain(:, j start:j end); end
I suggest the mini-batch learning function has these inputs and outputs
function [Wstar, bstar] = MiniBatchGD(X, Y, GDparams, W, b, lambda)
where X contains all the training images, Y the labels for the training images, W, b are the initial values for the network’s parameters, lambda is the regularization factor in the cost function and
•    GDparams is an object containing the parameter values n batch, eta and n epochs
For my initial experiments I set n batch=100, eta=.01, n epochs=20 and lambda=0. To help you debug I suggest that after each epoch you compute the cost function and print it out (and save it) on all the training data. For these parameter settings you should see that the
 
training cost decreases for each epoch. After the first epoch my cost score on all the training data was 2.0276 where I had set the random number seed generator to rng(400) and I had initialized the weight matrix before the bias vector. In figure 7 you can see the training cost score when I run these parameter settings for 40 epochs. The cost score on the validation set is plotted in red in the same figure.
(Note: in Tensorflow and other software packages they count in the number of update steps as opposed to the number of training epochs. Given N training images and mini-batches of size n batch then one epoch will result in N/n batch update steps of the parameter values. Obviously, the performance of the resulting network depends on the number of update steps and how much training data is used. Therefore to make fair comparisons one should make sure the number of update steps consistent across runs - for instance if you change the size of the mini-batch, number of training images, etc.. you may need to run more or fewer epochs of training.)
Figure 3: The graph of the training and validation loss computed after every epoch. The network was trained with the following parameter settings: n batch=100, eta=.01, n epochs=40 and lambda=0.
When you have finished training you can compute the accuracy of your learnt classifier on the test data. My network achieves (after 40 epochs) an accuracy of 36.69%. (Much better than random but not great. Hopefully, we will achieve improved accuracies in the assignments to come when we build and train more complex notworks.)
After training you can also visualization the weight matrix W as an image and see what class templates your network has learnt. Figure 4 shows the templates my network learnt. Here is a code snippet to re-arrange each row of W (assuming W is a 10×d matrix) into a set of images that can be displayed by Matlab.
for i=1:10
im = reshape(W(i, :), 32, 32, 3);
s im{i} = (im - min(im(:))) / (max(im(:)) - min(im(:))); s im{i} = permute(s im{i}, [2, 1, 3]);
end
Then you can use either imshow, imagesc or montage to display the images in the cell array s im.
 
Figure 4: The learnt W matrix visualized as class template images. The network was trained with the following parameter settings: n batch=100, eta=.01, n epochs=40 and lambda=0.
To complete the assignment:
In the DD2424 Canvas Assignment 1 page upload:
•    The final (and cleaned up) version of the code you have written for the assignment. If you have written the code in multiple files please place them in one file for the uploaded version. This will make the life easier for me and the TAs!
•    A pdf document containing a short report. Here you should state whether you successfully managed to write the functions to correctly compute the gradient analytically, what tests you ran to check against the numerically computed gradient and the results of these tests. You should include the following plots/figures
1.    Graphs of the total loss and the cost function on the training data and the validation data after each epoch of the mini-batch gradient descent algorithm.
2.    Images representing the learnt weight matrix after the completion of training.
for the following parameter settings
– lambda=0, n epochs=40, n batch=100, eta=.1 – lambda=0, n epochs=40, n batch=100, eta=.01 – lambda=.1, n epochs=40, n batch=100, eta=.01 – lambda=1, n epochs=40, n batch=100, eta=.01
You should also report the final test accuracy your network achieves after each of these training runs. You should also make a short com¬ment on the effect of increasing the amount of regularization and the importance of the correct learning rate.
Exercise 2: Optional for Bonus Points
 
1. Optimize the performance of the network It would be interest¬ing to discover what is the best possible performance achievable by Assignment 1’s network on CIFAR-10. Here are some tricks/avenues you can explore to help bump up performance (most of the same tricks you can use for more complicated networks and they would probably have more of an effect for those networks):
(a)    Use all the available training data for training (all five batches minus a small subset of the training images for a validation set). Decrease the size of the validation set down to 1000.
(b)    Train for a longer time and use your validation set to make sure you don’t overfit or to keep a record of the best model before you begin to overfit.
(c)    Do a grid search to find good values for the amount of regularization and the learning rate.
(d)    Play around with decaying the learning rate by a factor .9 after each epoch. Or you can decay the learning rate by a factor of 10 after n epochs.
(e)    Implement Xavier initialization and comment if it stabilizes training.
(f)    Augment your training data by applying small random geometric and pho-tometric jitter to the original training data. You can do this on the fly by applying a random jitter to each image in the mini-batch before doing the forward and backward pass.
(g)    Train several networks using different initializations of the network’s param¬eters. Then apply the ensemble to a test image and use the majority vote as the prediction.
Bonus Points Available: 1 point (if you complete at least 3 improve-ments - you can follow my suggestions, think of your own or some combination.)
2. Train network by minimizing the SVM multi-class loss instead of the cross-entropy loss. Of course, you will have to write a new function to compute the gradients.
Bonus Points Available: 2 points
To get the bonus point(s) you must upload the following to the Canvas page Assignment 1 Bonus Points:
1.    Your code.
2.    A Pdf document which
- reports on your trained network with the best test accuracy, what improvements you made and which ones brought the largest gains. (Exercise 2.1)
- compares the test accuracy of the network trained with the SVM loss compared to the cross-entropy loss (for several sensible train¬ing parameter settings). (Exercise 2.2)

More products