$25
(a) Use the Lagrangian provided in the lecture to show that the equivalent dual problem can be written as
↵i i = 1,...,N
(b) Assume that we solved the above dual formulation and obtained the optimal ↵. For a given test data point x, how can we predict its class?
2. Neural Networks (60 points). In this problem, you will experiment on a subset of the Toronto Faces Dataset (TFD). You will complete the starter code provided to you, and experiment with the completed code. You should understand the code instead of using it as a black box.
We subsample 3374, 419 and 385 grayscale images from TFD as the training, validation and testing set respectively. Each image is of size 48⇥48 and contains a face that has been extracted from a variety of sources. The faces have been rotated, scaled and aligned to make the task easier. The faces have been labeled by experts and research assistants based on their expression. These expressions fall into one of seven categories: 1-Anger, 2-Disgust, 3-Fear, 4-Happy, 5-Sad, 6-Surprise, 7-Neutral. We show one example face per class in Figure 1.
Fig 1: Example faces. From left to right, the the corresponding class is from 1 to 7.
Code for training a neural network (fully connected) is partially provided in nn.py.
2.1. Complete the code [20 points]. Follow the instructions in nn.py to implement the missing functions that perform the backward pass of the network.
2.2. Generalization [10 points]. Train the neural network with the default set of hyperparameters. Report training, and validation errors and a plot of error curves (training and validation). Examine the statistics and plots of training error and validation error (generalization). How does the network’s performance di↵er on the training set vs. the validation set during learning?
2
2.3. Optimization [10 points]. Try di↵erent values of the learning rate (step size) ⌘ (“eta”) ranging from ⌘2{0.001,0.01,0.5}. What happens to the convergence properties of the algorithm (looking at both cross-entropy and percent-correct)? Try 3 di↵erent mini-batch sizes ranging from {10,100,1000}. How does mini-batch size a↵ect convergence? How would you choose the best value of these parameters? In each of these hold the other parameters constant while you vary the one you are studying.
2.4. Model architecture [10 points]. Try 3 di↵erent values of the number of hidden units for each layer of the fully connected network (range from {2,20,80}). You might need to adjust the learning rate and the number of epochs (iterations). Comment on the e↵ect of this modification on the convergence properties, and the generalization of the network.
2.5. Network Uncertainty [10 points]. Plot five examples where the neural network is not confident of the classification output (the top score is below some threshold), and comment on them. Will the classifier be correct if it outputs the top scoring class anyways? What to submit?
2.1) Completed code.
2.2) Final training and validation errors, and a plot of these errors across all iterations. Your comments on network’s performance.
2.3) The curves you obtained in the previous part for the given step size, mini-batch size choices. Your comments/answers to the two questions.
2.4) The curves you obtained in the previous part for the given number of hidden units. Your comments on the convergence/generalization.
2.5) Five example images and your comments, and answer to the question. Your comments on the convergence/generalization.