In this assignment you will train and test k-layer networks with multiple outputs to classify images (once again) from the CIFAR-10 dataset. You will upgrade your code from Assignment 2 in two significant ways:
1. Generalize your code so that you can train and test k-layer networks.
2. Incorporate batch normalization into the k-layer network both for training and testing.
The overall structure of your code for this assignment should mimic that from Assignment 2. You will mainly just have to modify the functions that implement the forward and backward passes. As in Assignment 2 you will train your network with mini-batch gradient descent and cyclical learning rates. Before the explicit instructions for the assignment, we present the mathematical details that you will need to complete the assignment. As in the previous assignment we will train our networks by minimizing a cost function, a weighted sum of the cross-entropy loss on the labelled training data and an L2 regularization of the weight matrices see equation (20) for the general form, using mini-batch gradient descent.
Background 1: k-layer network
The mathematical details of the first network you will implement are as follows. Given an input vector, x, of size d×1 our classifier outputs a vector of probabilities, p (K × 1), for each possible output label. for l = 1,2,...,k − 1
s(l) = Wl x(l−1) + bl
(1)
x(l) = max(0,s(l))
and then finally
(2)
s = Wk x(k−1) + bk
(3)
p = SOFTMAX(s)
(4)
The equations for the gradient computations of the back-propagation algorithm for a k-layer network are given in Lecture 4. I suggest you implement the efficient version of the backward pass at it will make your computations much faster.
Background 2: k-layer network with Batch Normalization
You will discover that training a network is tricky as its number of layers increase. A proper initialization of the weights is key and ηmax in the cyclic learning rate approach may be relatively small and thus training is slow. The second part of the assignment is therefore devoted to implementing batch normalization to overcome these limitations and also to get a feel for the effect of this process on training. We now give the explicit mathematical details for batch normalization for a k-layer network.
At test time it is assumed that you have a pre-computed
• µ(l) - an estimated mean for the unnormalized scores s(l) at layer l (has the same size as s(l)),
• v(l) - the vector containing the estimated variance for each dimension of s(l).
It is also assumed that you have learnt during training extra parameters γ1,...,γk−1 and β1,...,βk−1 to scale and shift each entry of the normalized activations at each layer. Batch normalization, followed by a scale and shift, is then implemented at test time with these equations: for l = 1,2,...,k − 1
s(l) = Wl x(l−1) + bl
(5)
ˆs(l) = BatchNormalize(s(l),µ(l),v(l))
(6)
˜s (7)
x(l) = max(0,˜s(l))
and then finally
(8)
s = Wk x(k−1) + bk
(9)
p = SOFTMAX(s)
(10)
where
BatchNormalize(s diag(v
and 0 is a small number, of the order of magnitude of Matlab’s eps constant, to ensure you don’t divide by 0.
Forward pass of BN for back-propagation training
During the forward pass of BN training for each mini-batch you also normalize the scores at each layer, but you compute the mean and variances of the un-normalized scores from the data in the mini-batch. In more detail assume that we have a mini-batch of data B = {(x1,y1),...,(xn,yn)}. At each layer 1 ≤ l ≤ k − 1 you must make the following computations. Compute the un-normalized scores at the current layer l for each example in the mini-batch
s for i = 1,...,n (12)
Then compute the mean and variances of these un-normalized scores
µ (13)
for j = 1,...,ml (14)
where ml is the dimension of the scores at layer l. Given the computed mean and variances we can now normalize the scores for the mini-batch and subsequently apply ReLu. So for i = 1,...,n:
= BatchNormalize( ) (15)
(16) x ) (17)
The final layer is then applied as usual, for i = 1,...,n:
s (18)
pi = SOFTMAX(si) (19)
Backward pass of BN for back-propagation training
As we have applied score normalization, plus a scaling and a shifting, during the forward pass we have to compensate for these in the backward pass of the back-propagation algorithm. As per usual let J represent the cost function for the mini-batch that is
(20) From the forward pass of the back-prop algorithm you should store
and µ(l),v(l) for the intermediary layers l = 1,...,(k−1) and then also the final probability vectors output for each example in the batch:
Pbatch
Given these quantities it is then possible to compute the gradient for all the parameters that have to be learnt in the network:
• Propagate the gradient through the loss and softmax operations
Gbatch = −(Ybatch − Pbatch) • The gradients of J w.r.t. bias vector bk and Wk
(21)
batchXbatch(k−1) batch1n (22)
• Propagate Gbatch to the previous layer
Gbatch = WkTGbatch
(23)
Gbatch = Gbatch Ind (24)
• For l = k − 1,k − 2,...,1
1. Compute gradient for the scale and offset parameters for layer l:
∂J
batch , batch1n (25)
2. Propagate the gradients through the scale and shift
Gbatch = Gbatch (26)
3. Propagate Gbatch through the batch normalization
Gbatch = BatchNormBackPass Gbatch,S(27)
4. The gradients of J w.r.t. bias vector bl and Wl
l T ∂J
batch batch1n (28)
5. If l 1 propagate Gbatch to the previous layer
Gbatch = WlTGbatch
(29)
Gbatch = Gbatch Ind (30)
where the function BatchNormBackPassGbatch,S corresponds to the following steps:
(31)
(32)
G1 = Gbatch (33)
G2 = Gbatch (34)
(35) c (36)
Gbatch (37)
assuming v . Remember that the ith column of Gbatch sent into BatchNormBackPass represents while the ith column of Gbatch returned by BatchNormBackPass represents .
You should note that the network’s bias parameters bl for l = 1,...,k − 1 are superfluous when using batch normalization as you will subtract away these biases when you normalize. These bias parameters will be estimated as effectively zero vectors when you train.
Exponential moving average for batch means and variances
When you train your network with batch normalization you should keep an exponential moving average estimate of the mean and variances for the un-normalized scores for each layer that will be used during test time. You can achieve this, after each forward pass of the mini-batch gradient descent algorithm (which generates a new µ(l) and v(l)), by setting:
for l = 1,...,k − 1
µ (38)
vav(l) = αvav(l) + (1 − α)v(l) (39)
where α ∈ (0,1) and typically α ≈ .9 (in this assignment as our training is shorter than usual and this requires a smaller value for α). You can initialize µ(avl) to be equal to the µ(l) obtained from the very first mini-batch update step and similarly for v(l).
Exercise 1: Upgrade Assignment 2 code to train & test k-layer networks
In Assignment 2 you wrote code to train and test a 2-layer neural network. For the first part of this assignment you should upgrade your code from Assignment 2 so that you can train and test a k-layer network. If you have a decent architecture for your code, this should not involve too much coding. You will need to refine the functions and data structures that you use
1. to store and initialize the parameters of your network,
2. to apply the network to input vectors and keep a record of the intermediary scores when you apply the network (the x(l)’s in equation 4) (forward pass),
3. to compute the gradient of the cost function for a mini-batch relative to the parameters of the network using the gradient equations in the lectures notes (backward pass).
When you have upgraded your code you should debug the gradient computations and check them numerically as previously. As before please only do numerical checks on networks with a small number of nodes in each layer and a much reduced dimensional input data (d ≈ 10) to avoid numerical precision issues. You should start with a 2-layer network, then a 3-layer network and then finally a 4-layer network. You’ll probably notice that the discrepancy between the analytic and the numerical gradients increases for the earlier layers as the gradient is back-propagated through the network. You can re-read the relevant section of the Additional material for lecture 3 from Standford’s course Convolutional Neural Networks for Visual Recognition to get all the tips and potential issues. But remember to make your checks initially with lambda=0. Once you have convinced yourself that your analytic gradient computations are bug free then you should continue with the assignment.
Exercise 2: Can I train multi-layer networks?
First check, with the new version of your code, you can replicate the (default) results you achieved in Assignment 2 with a 2-layer network with 50 nodes in the hidden layer using mini-batch gradient descent with a cyclic learning rate. If the answer is yes, then your next task is to train a 3-layer network with 50 and 50 nodes in the first and second hidden layer respectively with the same learning parameters. You can use the following hyper-parameter setting n batch=100, eta min = 1e-5, eta max = 1e-1, lambda=.005, two cycles of training and n s = 5 * 45,000 / n batch. You should use a careful initialization such as Xavier or He initialization. With these settings my trained network after two cycles got a test accuracy of ∼52%. I also randomly shuffle the order of the training data after each epoch. Now consider a 9-layer network whose number of nodes at the hidden layers are [50, 30, 20, 20, 10, 10, 10, 10]. This network has the same number of weight parameters as the earlier network. Train the network with the same hyper-parameter settings as before and see what happens to performance.
For the deeper network its performance dropped by quite a bit. Generally as a network becomes deeper it becomes harder to train when training with variants of mini-batch gradient descent and using a more standard decay of the learning rate. The technique of batch normalization is a way to overcome this difficulty.
Exercise 3: Implement batch normalization
You have seen that training networks with many layers can be tricky. In the lecture notes I told you batch normalization overcomes this problem. Now it’s your turn to implement it.
First, consider the forward pass where you apply the network to the input data in a mini-batch. You will have written, for the first part of this assignment, a function that evaluates the network on a mini-batch of input data and returns the probability score and the intermediary activations (for each hidden layer) for each example in the mini-batch. In batch normalization you will need to augment your code so that it implements equations (12) (19) (and returns the intermediary vectors needed by the backward pass). In the first version of your new function you should write it assuming the layer means and variances are computed from the mini-batch of data sent into the function. You will, however, also call this function at test time and in this case it is assumed that the un-normalized scores are normalized by known pre-computed means and variances that have been estimated during training. Thus you should write a final version of the function so that it can take a variable number of inputs depending on whether you send it precomputed means and variances or not. You can do this in Matlab using the varargin cell structure. Use the help command to get more details.
Note: If you store your un-normalized scores for a batch ar the lth layer
(l) in the matrix scores (this would correspond to Sbatch in the mathematical description) of size m × n where n is the number of examples in the minibatch then this Matlab code will compute the variance for each dimension: var scores = var(scores, 0, 2);
The matlab function var computes the variances by dividing the relevant sum-of-squares quantities by n-1, however, in the original batch normalization paper it is assumed the variance is computed by dividing by n instead. The back-propagation equations in the lecture slides assume the latter therefore you will have to compensate for this fact by applying:
var scores = var scores * (n-1) / n;
Next up is implementation of the backward pass. You should upgrade the functions in the first part of the assignment to implement equations (21)(30). Once you have completed this then it is time to check your analytic gradient computations as per usual. Just a couple of tips:
• When you compute the loss in the numerical calculation of the gradient you have to apply the network function to the mini-batch data. When you do this you have to apply batch normalization and you should, as in your analytic gradient computations, compute the un-normalized means and variances from the mini-batch data.
• Make sure your mini-batch has size 1. You want to make sure your mean and variance computations are okay.
You should check with a 2-layer network (with 50 hidden nodes) and then a 3-layer network (with 50 and 50 hidden nodes respectively). After you have convinced yourself that your gradient computations are okay then you should move on to training your network. (The numerical gradient computations from Assignment 2 have to be augmented with the numerical gradient computations w.r.t. the parameters γl and βl.)
There is just one upgrade you need to make in the top level function implementing the mini-batch gradient descent learning algorithm. You need to keep an exponential moving average of the batch mean and variances for the un-normalized scores for each layer of your network as defined by equations (38) and (39). You should use these moving averages when you compute the cost and accuracy on the training and validation sets after each epoch.
You should train a 3-layer network with 50 and 50 nodes in the first and second hidden layers respectively. You should train as in Assignment 2 with cyclic learning rate. I achieved quite good results (when using 45,000 training examples) with He initialization and hyper-parameter settings of eta min = 1e-5, eta max = 1e-1, lambda=.005, two cycles of training and n s = 5 * 45,000 / n batch. With this set-up I was able to achieve test accuracies of ∼53.5%. (To reach this level of accuracy when using BN it seems to be important that you shuffle the order of your training samples after each epoch. I think the reason for this is that it ensures you have different combinations of training examples in your batches over epochs and this is good for regularization and estimating the mean and standard deviation of the activations at each layer.) You should perform a proper some coarse-to-fine search to find a good value for lambda. After you have found a good setting for lambda, you should train a network for 3 cycles and see what test accuracy this network can achieve.
Now reconsider the 9-layer network whose number of nodes at the hidden layers are [50, 30, 20, 20, 10, 10, 10, 10] respectively. Train the network with the same hyper-parameter settings as your 3-layer network and see what happens to performance. Not bad! Hopefully this result will convince you that batch normalization is a good thing!
The frequently stated pros of batch normalization are that training becomes more stable, higher learning rates can be used as opposed to when batch normalization is not used and it acts as a form of regularization. I would like you to explore if you can get some experimental evidence that is consistent with one of these stated pros. You will train your 3-layer network with 50 nodes at each hidden layer and basic hyper-parameter setting will etamin = 1e-5, eta max = 1e-1, lambda=.005, two cycles of training with n s = 5 * 45,000 / n batch.
Sensitivity to initialization For each training regime instead of using He initialization, initialize each weight parameter to be normally distributed with sigmas equal to the same value sig at each layer. For three runs set sig=1e-1, 1e-3 and 1e-4 respectively and train the network with and without BN and see the effect on the final test accuracy. Use n s = 2 * 45,000 / n batch if training is slow on your machine and you want to complete training with fewer update steps.
s53.
To complete the assignment:
To pass the assignment you need to upload to bilda:
1. The code for your k-layer network trained and tested with batch normalization assembled into one file.
2. A brief pdf report with the following content:
i) State how you checked your analytic gradient computations and whether you think that your gradient computations are bug free for your k-layer network with batch normalization.
ii) Include graphs of the evolution of the loss function when you train the 3-layer network with and without batch normalization with the given default parameter setting.
iii) Include graphs of the evolution of the loss function when you train the 6-layer network with and without batch normalization with the given default parameter setting.
iv) State the range of the values you searched for lambda when you tried to optimize the performance of the 3-layer network trained with batch normalization, and the lambda settings for your best performing 3-layer network. Also state the test accuracy achieved by this network.
v) Include the loss plots for the training with Batch Norm Vs no Batch Norm for the experiment related to Sensitivity to initialization and comment on your experimental findings.
Exercise 4: Optional for bonus points
1. Optimize the performance of the network
It would be interesting to discover what is the best possible performance achievable by a k-layer fully connected network on CIFAR-10. From a quick search of the web it seems the best performance of a fully connected network on CIFAR-10 is 78%. The details of this network are available at How far can we go without convolution: Improving fully connected networks by Lin, Memisevic and Konda.
Here are some tricks/avenues you can explore to help bump up performance:
(a) Do a more exhaustive random search to find good values for the amount of regularization.
(b) Do a more thorough search to find a good network architecture. Does making the network deeper improve performance?
(c) It has been empirically reported in several works that you get better performance by the final network if you apply batch normalization to the scores after the non-linear activation function has been applied. You could investigate whether this is the case. You will have to update your forward and backward pass of the back-prop algorithm accordingly.
(d) Apply dropout to your training if you have a high number of hidden nodes and you feel you need more regularization.
(e) Augment your training data by applying small random geometric and photometric jitter to the original training data. You can do this on the fly by applying a random jitter to each image in the mini-batch before doing the forward and backward pass.
Bonus Points Available: 1 bonus point for each non-trivial improvement (capped at 4 bonus points, you can follow my suggestions, think of your own or some combination of the two.)
To get the bonus point you must submit
(a) Your code.
(b) Pdf document reporting on your trained network with the best test accuracy, what improvements you made and which ones brought the largest gains (if any!).