$30
This homework is to familiarize yourself with gradient backpropagation in neural networks. You will implement and experiment with a neural network known as MLP for mutliclass classification. As a starting point, use the section on neural networks in the class materials.
1 THEORETICAL PART a : derivatives and relationships between basic functions
Given
— logistic sigmoid sigmoid( .
— hyperbolic tangent .
— softplus softplus(x) = ln(1 + exp(x))
— function sign which returns +1 if its argument is positive, -1 if negative and 0 if 0.
— 1S(x) is the indicator function which returns 1 if x ∈ S (or x respects condition S), otherwise returns 0.
— rectifier function which keeps only the positive part of its argument : rect(x) returns x if x ≥ 0 and returns 0 if x < 0. It is also named RELU (rectified linear unit) : rect(x) = RELU(x) = [x]+ = max(0,x) = 1{x0}(x)
1. Show that sigmoid(
2. Show that lnsigmoid(x) = −softplus(−x)
3. Show that the derivative of the sigmoid is : sigmoid0(x) = dsigmoiddx (x) = sigmoid(x)(1 − sigmoid(x))
4. Show that the tanh derivative is : tanh0(x) = 1 − tanh2(x)
5. Write the sign function using only indicator functions : sign(x) = ...
6. Write the derivative of the absolute function abs(x) = |x|. Note : its derivative at 0 is not defined, but your function abs0 can return 0 at 0. Note 2 : use the sign function : abs0(x) = ...
7. Write the derivative of the function rect. Note : its derivative at 0 is undefined, but your function can return 0 at 0. Note2 : use the indicator function. rect0(x) = ...
8. Let the squared L2 norm of a vector be : . Write the vector of the gradient :
9. Let the norm L1 of a vector be : kxk1 = Pi |xi|. Write the vector of the gradient :
2 THEORETICAL PART b (25 pts) : Gradient computation for parameters optimization in a neural net for multiclass classification
Let Dn = {(x(1),y(1)),...,(x(n),y(n))} be the dataset with x(i) ∈ Rd and y(i) ∈ {1,...,m} indicating the class within m classes. For vectors and matrices in the following equations, vectors are by default considered to be column vectors.
Consider a neural net of the type Multilayer perceptron (MLP) with only one hidden layer (meaning 3 layers total if we count the input and output layers). The hidden layer is made of dh neurons fully connected to the input layer. We shall consider a non linearity of type rectifier (Rectified Linear Unit or RELU) for the hidden layer. The output layer is made of m neurons that are fully connected to the hidden layer. They are equipped with a softmax non linearity. The output of the jth neuron of the output layer gives a score for the class j which is interpreted as the probability of x being of class j.
It is highly recommended that you draw the neural net as it helps understanding all the steps.
1. Let W(1) a dh × d matrix of weights and b(1) the bias vector be the connections between the input layer and the hidden layer. What is the dimension of b(1) ? Give the formula of the preactivation vector (before the non linearity) of the neurons of the hidden layer ha given x as input, first in a matrix form (ha = ...), and then details on how to compute one element haj = .... Write the output vector of the hidden layer hs with respect to ha.
2. Let W(2) a weight matrix and b(2)a bias vector be the connections between the hidden layer and the output layer. What are the dimensions of W(2) and b(2) ? Give the formula of the activation function of the neurons of the output layer oa with respect to their input hs in a matrix form and then write in a detailed form for oak.
3. The output of the neurons at the output layer is given by :
os = softmax(oa)
Give the precise equation for osk using the softmax (formula with the exp). Show that the osk are positive and sum to 1. Why is this important?
4. The neural net computes, for an input vector x, a vector of probability scores os(x). The probability, computed by a neural net, that an observation x belong to class y is given by the yth output ). This suggests a loss function such as :
Find the equation of L as a function of the vector oa. It is easily achievable with the correct substitution using the equation of the previous question.
5. The training of the neural net will consist of finding parameters that minimize the empirical risk Rˆ associated with this loss function. What is Rˆ ? What is precisely the set θ of parameters of the network? How many scalar parameters nθ are there? Write down the optimization problem of training the network in order to find the optimal values for these parameters.
6. To find a solution to this optimization problem, we will use gradient descent. What is the (batch) gradient descent equation for this problem?
7. We can compute the vector of the gradient of the empirical risk Rˆ with respect to the parameters set θ this way
This hints that we only need to know how to compute the gradient of the loss L with an example(x,y) with respect to the parameters, defined as followed :
We shall use gradient backpropagation, starting with loss L and going to the output layer o then down the hidden layer h then finaly at the input layer x.
Show that
onehotm(y)
Note : Start from the expression of L as a function of oa that you previously found. Start by computing (using the start of the expression of the logarithm derivate). Do the same thing for .
8. What is the numpy equivalent expression (it can fit in 2 operations)?
grad oa = ...
...
IMPORTANT : From now on when we ask to ”compute” the gradients or partial derivates, you only need to write them as function of previously computed derivates (do not substitute the whole expressions already computed in the previous questions!)
9. Compute the gradients with respect to parameters W(2) and b(2) of the
(2) (2) a the result output layer. Since L depends on Wkj and bk only through ok of the chain rule is : and
10. Write down the gradient of the last question in matrix form and define the dimensions of all matrix or vectors involved.
(What are the dimensions?)
Take time to understand why the above equalities are the same as the equations of the last question.
Give the numpy form : grad b2 = ...
grad W2 = ...
11. What is the partial derivate of the loss L with respect to the output of the neurons at the hidden layer? Since L depends on hsj only through the activations of the output neurons oa the chain rule yields :
12. Write down the gradient of the last question in matrix form and define the dimensions of all matrix or vectors involved.
(What are the dimensions?)
Take time to understand why the above equalities are the same as the equations of the last question.
Give the numpy form : grad hs = ...
13. What is the partial derivate of the loss L with respect to the activation of the neurons at the hidden layer? Since L depends on the activation haj only through hsj of this neuron, the chain rule gives :
Note h = rect( ) : the rectifier function is applied elementwise. Start by writing the derivate of the rectifier function = rect0(z) = ....
14. Write down the gradient of the last question in matrix form and define the dimensions of all matrix or vectors involved. Give the numpy form.
15. What is the gradient with respect to the parameters W(1) and b(1) of the hidden layer?
Note : same logic as a previous question
16. Write down the gradient of the last question in matrix form and define the dimensions of all matrix or vectors involved. Give the numpy form. Note : same logic as a previous question
17. What are the partial derivates of the loss L with respect to x? Note : same logic as a previous question
18. We will now consider a regularized emprical risk : R˜ = Rˆ + L(θ), where θ is the vector of all the parameters in the network and L(θ) describes a scalar penalty as a function of the parameters θ. The penalty is given importance according to a prior preferences for the values of θ. The L2 (quadratic) regularization that penalizes the square norm (norm L2) of the weights (but not the biases) is more standard, is used in ridge regression and is sometimes called ”weight-decay”. Here we shall consider a double regularization L2 and L1 which is sometimes named “elastic net” and we will use different hyperparameters (positive scalars λ11,λ12,λ21,λ22) to control the effect of the regularization at each layer
L(θ)
=
L(W(1),b(1),W(2),b(2))
=
=
i,j ij i,j
+λ22 X(Wij(2))2
ij
We will in fact minimize the regularized risk R˜ instead of Rˆ. How does this change the gradient with respect to the different parameters?
3 PRACTICAL PART (50 pts) : Neural network implementation and experiments
We ask you to implement a neural network where you compute the gradients using the formulas derived in the previous part (including elastic net type regularization). You must not use an existing neural network library, but you must use the derivation of part 2 (with corresponding variable names, etc). Note that you can reuse the general learning algorithm structure that we used in the demos, as well as the functions used to plot the decision functions.
Useful details on implementation :
— Numerically stable softmax. You will need to compute a numerically stable softmax. Refer to lecture notes for a proper way of computing a numerically stable softmax. Start by writing the expression for a single vector, then adapt it for a mini-batch of examples stored in a matrix.
— Parameter initialization. As you know, it is necessary to randomly initialize the parameters of your neural network (trying to avoid symmetry and saturating neurons, and ideally so that the pre-activation lies in the bending region of the activation function so that the overall networks acts as a non linear function). We suggest that you sample the weights of a layer from a uniform distribution in , where nc is the number of inputs for this layer (changing from one layer to the other). Biases can be initialized at 0. Justify any other initialization method.
— fprop and bprop. We suggest you implement methods fprop and bprop. fprop will compute the forward progpagation i.e. step by step computation from the input to the output and the cost, of the activations of each layer. bprop will use the computed activations by fprop and does the backpropagation of the gradients from the cost to the input following precisely the steps derived in part 2.
— Finite difference gradient check. We can estimate the gradient numerically using the finite difference method. You will implement this estimate as a tool to check your gradient computation. To do so, calculate the value of the loss function for the current parameter values (for a single example or a mini batch). Then for each scalar parameter θk, change the parameter value by adding a small perturbation and calculate the new value of the loss (same example or minibatch), then set the value of the parameter back to its original value. The partial derivative with respect to this parameter is estimated by dividing the change in the loss function by ε. The ratio of your gradient computed by backpropagation and your estimate using finite difference should be between 0.99 and 1.01.
— Size of the mini batches. We ask that your computation and gradient descent is done in minibatches (as opposed to the whole training set) with adjustable size using a hyperparameter K. In the minibatch case, we do not manipulate a single input vector, but rather a batch of input vectors grouped in a matrix (that will give a matrix representation at each layer, and for the input). In the case where the size is one, we obtain an equivalent to the stochastic gradient. Given that numpy is efficient on matrix operations, it is more efficient to perform computations on a whole minibatch. It will greatly impact the execution time.
Experiments : We will use the two circles dataset and the task of classifying pieces of clothes from fashion MNIST (see links on course website).
1. As a beginning, start with an implementation that computes the gradients for a single example, and check that the gradient is correct using the finite difference method described above.
2. Display the gradients for both methods (direct computation and finite difference) for a small network (e.g. d = 2 and dh = 2) with random weights and for a single example.
3. Add a hyperparameter for the minibatch size K to allow compute the gradients on a minibatch of K examples (in a matrix), by looping over the K examples (this is a small addition to your previous code).
4. Display the gradients for both methods (direct computation and finite difference) for a small network (e.g. d = 2 and dh = 2) with random weights and for a minibatch with 10 examples (you can use examples from both classes from the two circles dataset).
5. Train your neural network using gradient descent on the two circles dataset. Plot the decision regions for several different values of the hyperparameters (weight decay, number of hidden units, early stopping) so as to illustrate their effect on the capacity of the model.
6. As a second step, copy your existing implementation to modify it to a new implementation that will use matrix calculus (instead of a loop) on batches of size K to improve efficiency. Take the matrix expressions in numpy derived in the first part, and adapt them for a minibatch of size K. Show in your report what you have modified (describe the former and new expressions with the shapes of each matrices).
7. Compare both implementations (with a loop and with matrix calculus) to check that they both give the same values for the gradients on the parameters, first for K = 1, then for K = 10. Display the gradients for both methods.
8. Time how long takes an epoch on fashion MNIST (1 epoch = 1 full traversal through the whole training set) for K = 100 for both versions (loop over a minibatch and matrix calculus).
9. Adapt your code to compute the error (proportion of misclassified examples) on the training set as well as the total loss on the training set during each epoch of the training procedure, and at the end of each epoch, it computes the error and average loss on the validation set and the test set. Display the 6 corresponding figures (error and average loss on train/valid/test), and write them in a log file.
10. Train your network on the fashion MNIST dataset. Plot the training/valid/test curves (error and loss as a function of the epoch number, corresponding
to what you wrote in a file in the last question). Add to your report the curves obtained using your best hyperparameters, i.e. for which you obtained your best error on the validation set. We suggest 2 plots : the first one will plot the error rate (train/valid/test with different colors, show which color in a legend) and the other one for the averaged loss (on train/valid/test). You should be able to get less than 20% test error.