$30
Computer Vision and Deep Learning
Task 1: Regression
Notation: We use index k to represent a node in the output layer, index j to represent a node in the hidden layer, and index i to represent an input unit xi. Hence, the weight from node i in the input layer to node k in the output layer is wki. We write the activation of output unit k as ˆyk = f(zk), where f represents the output unit activation function (sigmoid for logistic regression or softmax for softmax regression). In equations where multiple training examples are used (for example summing over samples), we will use n to specify which training example we are referring to. Hence, ykn is the output of node k for training example n. If we do not specify n by writing yk, we implicitly refer to ykn. Capital letters N, I, J or K refer to the total number of nodes in a layer. For logistic regression we will not specify which output node we’re referring to, as there is only a single output node k. Therefore weight wk,i can be written as wi.
Logistic Regression
Logistic regression is a simple tool to perform binary classification. Logistic regression can be modeled as using a single neuron reading a input vector x ∈ RI+[1] and parameterized by a weight vector w ∈ RI+1. I is the number of input nodes, and we add a 1 at the beginning for a bias parameter (this is known as the ”bias trick”). The neuron outputs the probability that x is a member of class C1. This can be written as,
(1)
P(x ∈ C2|x) = 1 − P(x ∈ C1|x) = 1 − f(x) (2)
where f(x) returns the probability of x being a member of class C1; f ∈ [0,1] 1. By defining the output of our network as ˆy, we have ˆy = f(x).
We use the cross entropy loss function (Equation 3) for two categories to measure how well our function performs over our dataset. This loss function measures how well our hypothesis function f does over the N data points.
, where Cn(w) = −(yn ln(ˆyn) + (1 − yn)ln(1 − yˆn)) (3)
Here, yn is the target value (also known as the label of the image). Note that we are computing the average cost function, such that the magnitude of our cost function is not dependent on number of training examples. Our goal is to minimize this cost function through gradient descent, such that the cost function reaches a minimum of 0. This happens when yn = yˆn for all n.
Softmax Regression
Softmax regression is simply a generalization of logistic regression to multi-class classification. Given an input x which can belong to K different classes, softmax regression will output a vector ˆy (with length K), where each element ˆyk represents the probability that x is a member of class k.
I
, where zk = wkT · x = Xwk,i · xi (4)
i
Equation 4 is known as the Softmax function and it has the attribute that = 1. Note that now w is no longer a vector, but a weight matrix, w ∈ RK×I.
The cross-entropy cost function for multiple classes is defined as,
) (5)
For this task, please:
(a) [0.275pt] Derive the gradient for Logistic Regression. To minimize the cost function with gradient descent, we require the gradient of the cost function. Show that for Equation 3, the gradient is:
(6)
Show thorough work such that your approach is clear.
Hint: To solve this, you have to use the chain rule. Also, you can use the fact that:
)) (7)
(b) [0.375pt] Derive the gradient for Softmax Regression. For the multi-class cross entropy cost in Equation 5, show that the gradient is:
) (8)
A few hints if you get stuck:
• Derivation of the softmax is the hardest part. Break it down into two cases.
•
•
Task 2: Logistic Regression through Gradient Descent
In this assignment you are going to start classifying digits in the well-known dataset MNIST. The MNIST dataset consists of 70,000 handwritten digits, split into 10 object classes (the numbers 0-9). The images are 28x28 grayscale images, and every image is perfectly labeled. The images are split into two datasets, a training set consisting of 60,000 images, and a testing set consisting of 10,000 images. For this assignment, we will use a subset of the MNIST dataset[2].
Bias trick: Each image is 28x28, so the unraveled vector will be x ∈ R784. For each image, append a ’1’ to it, giving us x ∈ R785. With this trick, we don’t need to implement the forward and backward pass for the bias.
Logistic Regression through gradient descent
For this task, we will use mini-batch gradient descent to train a logistic regression model to predict if an image is either a 2 or a 3. We will remove all images in the MNIST dataset that are not a 2 or 3 (this pre-processing is already implemented in the starter code). The target is 1 if the the input is from the ”2” category, and 0 otherwise.
Mini-batch gradient descent is a method that takes a batch of images to compute an average gradient, then use this gradient to update the weights. Use the gradient derived for logistic regression to classify x ∈ R785 for the two categories 2’s and 3’s.
Vectorizing code: We recommend you to vectorize the code with numpy, which will make the runtime of your code significantly faster. Note that vectorizing your code is not required, but highly recommended (it will be required for assignment 2). Vectorizing it simply means that if you want to, for example, compute the gradient in Equation 6, you can compute it in one go instead of iterating over the number of examples and weights. For example, wTx can be witten as w.dot(x).
For this task, please:
(a) [0.5pt] Before implementing our gradient descent training loop, we will implement a couple of essential functions. Implement four functions in the file task2a.py.
• Implement a function that pre-processes our images in the function pre_process_images. This should normalize our images from the range [0,255] to [−1,1] [3], and it should apply the bias trick. • Implement a function that performs the forward pass through our single layer neural network. Implement this in the function forward. This should implement the network outlined by Equation 1.
• Implement a function that performs the backward pass through our single layer neural network. Implement this in the function backward. To find the gradient for our weight, we can use the equation derived in task 1 (Equation 6).
• Implement cross entropy loss in the function cross_entropy_loss. This should compute the average of the cross entropy loss over all targets/labels and predicted outputs. The cross entropy loss is shown in Equation 3.
We have included a couple of simple tests to help you debug your code. This also includes a gradient approximation test that you should get working. For those interested, this is explained in more detail in the Appendix.
Note that you should not start on the subsequent tasks before all tests are passing!
(b) [0.35pt] Implement logistic regression with mini-batch gradient descent for a single layer neural network. The network should consist of a single weight matrix with 784 + 1 inputs and a single output (the matrix will have shape 785×1). Initialize the weights (before any training) to all zeros. We’ve set the default hyperparameters for you, so there is no need to change these.
During training, track the training loss for each gradient step (this is implemented in the starter code). Less frequently, track the validation loss over the whole validation set (in the starter code, this is tracked every time we progress 20% through the training set).
Implement this in the function train_step in task2.py.
(report) In your report, include a plot of the training and validation loss over training. Have the number of gradient steps on the x-axis, and the loss on the y-axis. Use the ylim function to zoom in on the graph (for us, ylim([0, 0.2]) worked fine).
(c) [0.15pt] Implement a function that computes the binary classification accuracy4 over a dataset. Implement this in the function calculate_accuracy.
(report) Compute the accuracy on the training set and validation set over training. Plot this in a graph (similar to the loss), and include the plot in your report. Use the ylim function to zoom in on the graph (for us, ylim([0.93, 0.99]) worked fine).
Early Stopping: Early stopping is a tool to stop the training before your model overfits on the training dataset. By using a validation set along with our training set[4], we can regularly check if our model is starting to overfit or not. If we notice that the cost function on the validation set stops to improve, we can stop the training and return the weights at the minimum validation loss.
Dataset shuffling: Shuffling the training dataset between each epoch improves convergence. By using shuffling you present a new batch of examples each time which the network has never seen, which will produce larger errors and improve gradient descent [5].
(d) [0.15pt] Implement early stopping into your training loop. Use the following early stop criteria: stop the training if the validation loss does not improve after passing through 20% of the training dataset 10 times. Increase the number of epochs to 500. (report) After how many epochs does early stopping kick in?
You can implement early stopping in the file trainer.py.
(e) [0.2pt] Implement dataset shuffling for your training. Before each epoch, you should shuffle all the samples in the dataset. Implement this in the function batch_loader in utils.py
(report) Include a plot in your report of the validation accuracy with and without shuffle. You should notice that the validation accuracy has less ”spikes”. Why does this happen?
4accuracy = Number of correct predictions. The prediction is determined as 1 if ˆy ≥ 0.5 else 0 Total number of predictions
Task 3: Softmax Regression through Gradient Descent
In this task, we will perform a 10-way classification on the MNIST dataset with softmax regression. Use the gradient derived for softmax regression loss and use mini-batch gradient descent to optimize your model
One-hot encoding: With multi-class classification tasks it is required to one-hot encode the target values. Convert the target values from integer to one-hot vectors. (E.g: 3 → [0,0,0,1,0,0,0,0,0,0]). The length of the vector should be equal to the number of classes (K = 10 for MNIST, 1 class per digit). For this task, please:
(a) [0.55pt] Before implementing our gradient descent training loop, we will implement a couple of essential functions. Implement four functions in the file task3a.py.
• Implement a function that one-hot encodes our labels in the function one_hot_encode. This should return a new vector with one-hot encoded labels. • Implement a function that performs the forward pass through our single layer softmax model. Implement this in the function forward. This should implement the network outlined by Equation 4.
• Implement a function that performs the backward pass through our single layer neural network. Implement this in the function backward. To find the gradient for our weight, use Equation 8.
• Implement cross entropy loss in the function cross_entropy_loss. This should compute the average of the cross entropy loss over all targets/labels and predicted outputs. The cross entropy loss is defined in Equation 5.
We have included a couple of simple tests to help you debug your code.
(b) [0.1pt] The rest of the task 3 subtasks should be implemented in task3.py.
Implement softmax regression with mini-batch gradient descent for a single layer neural network. The network should consist of a single weight matrix, with 784 + 1 inputs and ten outputs (shape 785 × 10). Initialize the weight (before any training) to all zeros.
Implement this in train_step in task3.py. This function should be identical to the task2b, except that you are using a different cross entropy loss function.
(report) In your report, include a plot of the training and validation loss over training. Have the number of gradient steps on the x-axis, and the loss on the y-axis. Use the ylim function to zoom in on the graph (for us, ylim([0.2, .6]) worked fine).
(c) [0.15pt] Implement a function that computes the multi-class classification accuracy over a dataset. Implement this in the function calculate_accuracy.
(report) Include in your report a plot of the training and validation accuracy over training.
(d) [0.15pt] (report) For your model trained in task 3c, do you notice any signs of overfitting? Explain your reasoning.
Task 4: Regularization
One way to improve generalization is to use regularization. Regularization is a modification we make to a learning algorithm that is intended to reduce its generalization error [6]. Regularization is the idea that we should penalize the model for being too complex. In this assignment, we will carry this out by introducing a new term in our objective function to make the model ”smaller” by minimizing the weights.
J(w) = C(w) + λR(w), (9)
where R(w) is the complexity penalty and λ is the strength of regularization (constant). There are several forms for R, such as L2 regularization
, (10)
where w is the weight vector of our model.
For your report, please:
(a) [0.15pt] (report) Derive the update term for softmax regression with L2 regularization, that is, find , where C is given by Equation 5.
(b) [0.3pt] Implement L2 regularization in your backward pass. You can implement the regularization in backward in task3a.py.
For the remaining part of the assignment, you can implement the functionality in the file task3.py or create a new file.
(report) Train two different models with different λ values for the L2 regularization. Use λ = 0.0 and λ = 2.0. Visualize the weight for each digit for the two models. Why are the weights for the model with λ = 2.0 less noisy?
The visualization should be similar to Figure 1.
Figure 1: The visualization of the weights for a model with λ = 0.0 (top row), and λ = 2.0 (bottom row).
(c) [0.2pt] (report) Train your model with different values of λ: 2.0, 0.2, 0.02, 0.002. Note that for each value of λ, you should train a new network from scratch. Plot the validation accuracy for different values of λ on the same graph (during training). Have the accuracy on the y-axis, and number of training steps on the x-axis.
(d) [0.2pt] (report) You will notice that the validation accuracy degrades when applying any amount of regularization. What do you think is the reason for this?
(e) [0.2pt] (report) Plot the length (L2 norm, ||w||2) of the weight vector for the each λ value in task 4c. What do you observe? Plot the λ value on the x-axis and the L2 norm on the y-axis.
Note that you should plot the L2 norm of the weight after each network is finished training.
Appendix
Gradient Approximation test
When implementing neural networks from the bottom up, there can occur several minor bugs that completely destroy the training process. Gradient approximation is a method to get a numerical approximation to what the gradient should be, and this is extremely useful when debugging your forward, backward, and cost function. If the test is incorrect, it indicates that there is a bug in one (or more) of these functions.
It is possible to compute the gradient with respect to one weight by using numerical approximation:
, (11)
where ϵ is a small constant (e.g. 10−2), and C(wwij + ϵ) refers to the error on example xn when weight wji is set to wji + ϵ. The difference between the gradients should be within big-O of ϵ2, so if ϵ = 10−2, your gradients should agree within O(10−4).
If your gradient approximation does not agree with your calculated gradient from backpropagation, there is something wrong with your code!
[1] The function f is known as the sigmoid activation function
[2] We will only use the first 20,000 images in the training set to reduce computation time.
[3] Normalizing the input to be zero-centered improves convergence for neural networks. Read more about this in Lecun et al. Efficient Backprop Section 4.3.
[4] Note that we never train on the validation set.
[5] For those interested, you can read more about dataset shuffling in Section 4.2 in Efficient Backprop.
[6] The generalization error can be thought of as training error - validation error.