$25
Project-1 of “Neural Network and Deep Learning”
1 Neural Network
In this problem we will investigate handwritten digit classification. The inputs are 16 by 16 grayscale images of handwritten digits (0 through 9), and the goal is to predict the number of the given the image. If you run example_neuralNetwork it will load this dataset and train a neural network with stochastic gradient descent, printing the validation set error as it goes. To handle the 10 classes, there are 10 output units (using a {−1,1} encoding of each of the ten labels) and the squared error is used during training. Your task in this question is to modify this training procedure architecture to optimize performance.
Report the best test error you are able to achieve on this dataset, and report the modifications you made to achieve this. Please refer to previous instruction of writing the report.
1.1 Hint
Below are additional features you could try to incorporate into your neural network to improve performance (the options are approximately in order of increasing difficulty). You do not have to implement all of these, the modifications you make to try to improve performance are up to you and you can even try things that are not on the question list in Sec. 1.2. But, let’s stick with neural networks models and only using one neural network (no ensembles).
1
1.2 Questions
1. Change the network structure: the vector nHidden specifies the number of hidden units in each layer.
2. Change the training procedure by modifying the sequence of step-sizes or using different step-sizes for different variables. That momentum uses the update
where αt is the learning rate (step size) and βt is the momentum strength. A common value of βt is a constant
0.9.
3. You could vectorize evaluating the loss function (e.g., try to express as much as possible in terms of matrix operations), to allow you to do more training iterations in a reasonable amount of time.
4. Add l2 regularization (or l1-regularization) of the weights to your loss function. For neural networks this is called weight decay. An alternate form of regularization that is sometimes used is early stopping, which is stopping training when the error on a validation set stops decreasing.
5. Instead of using the squared error, use a softmax (multinomial logistic) layer at the end of the network so that the 10 outputs can be interpreted as probabilities of each class. Recall that the softmax function is
you can replace squared error with the negative log-likelihood of the true label under this loss, −logp(yi)
6. Instead of just having a bias variable at the beginning, make one of the hidden units in each layer a constant, so that each layer has a bias.
7. Implement “dropout", in which hidden units are dropped out with probability p during training. A common choice is p = 0.5.
8. You can do ‘fine-tuning’ of the last layer. Fix the parameters of all the layers except the last one, and solve for the parameters of the last layer exactly as a convex optimization problem. E.g., treat the input to the last layer as the features and use techniques from earlier in the course (this is particularly fast if you use the squared error, since it has a closed-form solution).
9. You can artificially create more training examples, by applying small transformations (translations, rotations, resizing, etc.) to the original images.
10. Replace the first layer of the network with a 2D convolutional layer. You will need to reshape the USPS images back to their original 16 by 16 format. The Matlab conv2 function implements 2D convolutions. Filters of size 5 by 5 are a common choice.
2