Starting from:

$25

CSE151B -  Deep Learning - Solved

Programming Assignment 




1.    There are two components to this assignment: written homework (Problems 1a-c), and a programming part. You will be writing a report in a conference paper format for the programming part of this assignment, reporting your findings. The report should be written using LATEX or Word in NeurIPS format (NeurIPS is the top machine learning conference, and it is now dominated by deep nets - it will be good practice for you to write in that format!). The templates, both in Word and LATEX are available from the 2015 NeurIPS format site.

2.    For the programming part, please work in pairs or triples. I would very much prefer that you not request different team sizes!! If you think you have a good argument for such, please discuss your circumstances with your TA, who will then present your case to me. Again, don’t forget to include a paragraph for each team member in your report that describes what each team member contributed to the project.

3.    You need to submit all of the source codes files and a readme.txt file that includes detailed instructions on how to run your code.

You should write clean code with consistent format, as well as explanatory comments. Do not submit any of your output plot files or .pyc files, just the .py files and a readme.txt that explains how to run your code.

4.    Using PyTorch, or any off-the-shelf code is strictly prohibited.

5.    Any form of copying, plagiarizing, grabbing code from the web, having someone else write your code for you, etc., is cheating. We expect you all to do your own work, and when you are on a team, to pull your weight. Team members who do not contribute will not receive the same scores as those who do. Discussions of course materials and homework solutions are encouraged, but you should write the final solutions to the written part alone. Books, notes, and Internet resources can be consulted, but not copied from. Working together on homework must follow the spirit of the Gilligan’s Island Rule (Dymond, 1986): No notes can be made (or recording of any kind) during a discussion, and you must watch one hour of Gilligan’s Island or something equally insipid before writing anything down. Suspected cheating has been and will be reported to the UCSD Academic Integrity office.

Multi-layer Neural Networks
In this assignment, we will be classifying digits from the SVHN dataset. SVHN is a real-world image dataset obtained from house numbers in Google Street View images. We will use Format-2 for the dataset where we have 32x32 cropped images of the digits and the task is to classify those digits. The original dataset is in RGB. We have modified the dataset by converting the images to grayscale, and stored them in NumPy arrays for you. You are expected to work on this modified dataset.

In Assignment 1, we classified the fashion MNIST dataset using a single-layer neural network with different output activation functions. (Logistic and Softmax regression). In this assignment, we are going to classify the SVHN dataset using multi-layer neural networks with softmax outputs.

Part I

Homework problems to be solved individually, and turned in individually
For this part we will not be accepting handwritten reports. Please use latex or word for your report. MathType is a handy tool for equations in Word - unfortunately, it is not compatible with recent updates by Microsoft, so you will have to use the equation editor in Word, which kind of sucks. This might be a good time to learn LaTex!. The free version (MathType Lite) has everything you need. This should be done individually, and each team member should turn in his or her own work separately.

1. (10pts) For multiclass classification on the SVHN dataset, we will use the cross-entropy error function and softmax as the output layer. In our network, we will have a hidden layer between the input and output, that consists of J units with the tanh activation function. So this network has three layers: an input layer, a hidden layer and a softmax output layer.

Notation: We use index k to represent a node in output layer and index j to represent a node in hidden layer and index i to represent a node in the input layer. Additionally, the weight from node i in the input layer to node j in the hidden layer is wij. Similarly, the weight from node j in the hidden layer to node k in the output layer is wjk.

(a)     (Bonus) Derivation. (4pts extra) In the following discussion, n denotes the nth input pattern.(see “Notation and Nomenclature” under General Resources). Recall that the definition of  , where ani is the weighted sum of the inputs to unit i for pattern n. Show that if our output activation function is softmax and the hidden layer activation function is tanh then, for the output layer, δkn = tnk−ykn and for the hidden layer, δjn = (1− tanh2(anj ))Pk(tnk − ykn) wjk.

Hint: There are two “hard parts” to this: 1) taking the derivative of the softmax; and 2) figuring out how to apply the chain rule to get the hidden deltas. Bishop and Chapter 8 of the PDP books both have good hints on the latter, and Bishop on the former. However, crucial steps have been left out of the Bishop derivation (Chapter 6). Our main hint here is: break it up into two parts (see equation 6.161 in Bishop), when k = k0 and when it doesn’t. Note that Bishop (Equation 4.31) defines  without a minus sign, which is different than we defined it above, and differently than the PDP book chapter 8.

(b)     (4pts) Update rule. Write the update rule for w’s in terms of the δ’s (given above) using learning rate α, starting with the gradient descent rule:

                                                                                                                                                                                      (1)

where

                                                                                                                                                                                          (2)

You have to write both the update rules, the hidden to output layer (wjk) update rule and the input to hidden (wij) update rule in a generalized form. (Hint: you will have to use chain rule for differentiation.) When we say “generalized form,” we mean that here, you can leave the output delta, i.e., “ ”. as simply δkn. I.e., that derivation is the extra credit above. Recall you start with this:

                                                                                                                                                                                       (3)

(c)     (6pts) Vectorize computation. The computation is much faster when you update all wijs and wjks at the same time, using matrix multiplications rather than for loops. Please show the update rule for the weight matrix from the hidden layer to output layer and the matrix from input layer to hidden layer, using matrix/vector notation.

Part II

Team Programming Assignment
2. Classification. Classification on the SVHN dataset. Refer to your derivations from Problem 2.

(a)     (0pts) Read in the SVHN data using the “load_data” function provided in the code. This loads X_train,Y _train,X_test, and Y _test. Split the training data using an 80-20 ratio to generate validation data. This time, to save time, we will only do one-fold cross-validation.

We are not using PCA here, but you will need to normalize the data by z-scoring it. That is, we will compute the average image over every image in Xtrain, as well as the standard deviation of each of the 32x32 pixels, and normalize by subtracting the mean, and dividing by the standard deviation. Now every pixel will roughly run between -1 and 1 or so, and will have mean 0. We will describe why this is a good idea in class, but if you want to look ahead, read the lecture on tricks of the trade, and/or read the reading “LeCun-et-al-98-Tricks-of-the-Trade-1998.pdf.” Now, using the same mean and standard deviation, z-score the validation and test sets as well.

(b)     (5pts) Implement backpropagation. Check your code for computing the gradient using a small subset of data. You can compute the slope with respect to one weight using the numerical approximation:

 

where  is a small constant, e.g., 10−2, and En is the cross-entropy error for one pattern. Do the following for several patterns: Compare the gradient computed using numerical approximation with the one computed as in backpropagation. The difference of the gradients should be within big-O of 2, so if you used 10−2, your gradients should agree within 10−4. (See section 4.8.4 in Bishop for more details). Note that w here is one weight in the network. You can only check one weight at a time this way - every other weight must stay the same!

Choose one output bias weight, one hidden bias weight, and two hidden to output weights and two input to hidden weights, and check that the gradient obtained for that weight after backpropagation is within

 ) of the gradient obtained by numerical approximation.

For each selected weight w, first increment the weight by small value , do a forward pass for one training example, and compute the loss. This value is . Then reduce w by the same amount , do a forward pass for the same training example and compute the loss . Then compute the gradient using equation mentioned above and compare this with gradient obtained by backpropagation. Report the results in a Table.

(c) (10pts) Using the vectorized update rule you obtained from 1(c), perform gradient descent to learn a classifier that maps each input data to one of the labels t ∈{0,...,9}, using a one-hot encoding. Use 128 hidden units. For this programming assignment, use mini-batch stochastic gradient descent throughout, in all problems.

You should use momentum in your update rule, i.e., include a momentum term weighted by γ, and set γ to 0.9. You should use cross-validation for early stopping of your training: Stop training when the error on the validation set goes up. Use the following criteria - If the validation error goes up for some “patience” number of epochs, stop training and save the weights which resulted in minimum validation error. The patience parameter could be 5, for example.

Describe your training procedure. Plot your training and validation accuracy (i.e., percent correct) vs. number of training epochs, as well as training and validation loss vs. number of training epochs. Report accuracy on test set using the best weights obtained through early stopping.

You may experiment with different learning rates, but you only need to report your results and plots on the best learning rate you find.

Your loss value should start around 2 (can you figure out why?) and the expected accuracy on the test dataset is close to 80% with the default config provided to you. It is acceptable for it to be a few points off but if there is a significant difference (for instance, < 70%), then there is probably a bug in your code, or you have too high or too low a learning rate. The loss plots should look like your standard training plots. You may or may not observe overfitting depending on your choice of hyperparameters.

(d)     (5pts) Experiment with Regularization. Starting with the network you used for part c, with new initial random weights, add weight decay to the update rule. (You will have to decide the amount of regularization, i.e., λ, a factor multiplied times the weight decay penalty. Experiment with L2 regularization using value of 1e-3 and 1e-6 for λ). Again, plot training and validation loss, training and validation accuracy, and report final test accuracy. For this problem, train about 10% more epochs than you found in part c (i.e., if you found that 100 epochs were best, train for 110 for this problem). Comment on the change of performance, if any.

Try using L1 regularization instead of L2 regularization. Explain any difference in performance, if you see any. In particular, do you see a decrease or increase in performance? Why could that be?

(e)     (5pts) Experiment with Activations. Starting with the network of part c, try using different activation functions for the hidden units. You are already using tanh, try the other two below. Note that the derivative changes when the activation rule changes!!

i.      Sigmoid.  ii. ReLU. f(z) = max(0,z)

The weight update rule is exactly the same for each activation function. The only thing that changes is the derivative of the activation function when computing the hidden unit δs. For each activation function you try, plot training and validation loss on one graph, training and validation accuracy on another, and report final test accuracy. Comment on the change of performance.

(f)      (5pts) Experiment with Network Topology. Starting with the network from part c, consider how the topology of the neural network changes the performance.

i.      Try halving and doubling the number of hidden units. Plot training and validation loss, training and validation accuracy, and report final test accuracy. How does performance change? Explain your results.

ii.    Change the number of hidden layers. Use two hidden layers instead of one. Create a new architecture that uses two hidden layers of equal size and has approximately the same number of parameters for the best model choice in the previous experiment. By that, we mean it should have roughly the same total number of weights and biases. Report the final test accuracy. Comment on the change of performance. [N.B. Here it is important to check your gradient again, in case you made a mistake when using two hidden layers.]

Instructions for Programming Assignment

The SVHN dataset and starter code has been provided to you on Piazza.

You need to complete the neuralnet.py, main.py, train.py and utils.py files to complete the assignment. This file is a skeleton code that is designed to guide you to build and implement your neural net in an efficient and modular fashion, and this will give you a feel for what developing models in PyTorch will be like.

Follow instructions in the code on how to install PyYAML. The config.yaml specifies the configuration for your Neural Network architecture, training hyperparameters, type of activation, etc. The purpose of each flag is indicated by the comment before it. Play around with the parameters here to decide what works best for the problem.

The class Activation includes the definitions for all activation functions and their gradients, which you need to fill in. The definitions of forward and backward have been given for you in this class - you will have to program the implementation. The code is structured in such a way that each activation function is treated as an additional layer on top of a linear layer that computes the net input (a) to the unit. To add an activation layer after a fully-connected or linear layer, a new object of this class needs to be instantiated and added to the model.

The Layer class denotes a standard fully-connected / linear layer. The forward and backward functions need to be implemented by you. As the name suggests, forward takes in an input vector ‘x’ and outputs the variable ‘a’. Do not apply the activation function on the computed weighted sum of inputs since the activation function is implemented as a separate layer, as mentioned above. The function backward takes the weighted sum of the deltas from the layer above it as input, computes the gradient for its weights (to be saved in ‘d_w’) and biases (to be saved in ‘d_b’). If there is another layer below that (multiple hidden layers), it‘ also passes the weighted sum of the deltas back to the previous layer. Otherwise, if the previous layer is the input layer, it stops there.

The NeuralNetwork class defines the entire network. The ‘__init__’ function has been implemented for you which uses the configuration specified in ‘config.yaml’ to generate the network. Make sure to understand this function very carefully since good understanding of this will be needed while implementing forward and backward for this class. The function ‘forward’ takes in the input dataset ‘x’ and targets (in one hot encoded form) as input, performs a forward pass on the data ‘x’ and returns the loss and predictions. The ‘backward’ function computes the error signal from saved predictions and targets and performs a backward pass through all the layers by calling backward pass for each layer of the network, until it reaches the first hidden layer above the input layer (usually there will only be one hidden layer for this project, but when there are more, there will be more backward passes). The ‘loss’ function computes cross-entropy loss by taking in the prediction y and targets and returns this loss.

The load_data method has been implemented for you. You need to implement rest of the required functions like the train and test functions. The requirements for these functions and all other functions are given in the code.


More products