1 Neural networks (error-backpropagation, initialization, and nonlinearity)
[Recommended maximum time spent: 1 hour]
In the lecture (see lec8.pdf), we have talked about error-backpropagation, a way to compute partial derivatives (or gradients) w.r.t the parameters of a neural network. We have also mentioned that optimization is challenging and nonlinearity is important for neural networks. In this question, you are going to (Q1.1) practice error-backpropagation, (Q1.2) investigate how initialization affects optimization, and (Q1.3) the importance of nonlinearity.
Specifically, you are given the following 1-hidden layer multi-layer perceptron (MLP) for a Kclass classification problem (see Fig. 1 for illustration and details), and (x ∈ RD,y ∈ {1,2,··· ,K})
is a labeled instance,
x ∈ RD
(1)
u = W(1)x + b(1)
,W(1) ∈ RM×D and b(1) ∈ RM
(2)
h (3)
a = W(2)h + b(2)
,W(2) ∈ RK×M and b(2) ∈ RK
(4)
z (5)
yˆ = argmaxk zk. (6)
For K-class classification problem, one popular loss function for training is the cross-entropy loss,
l = −X1[y == k]logzk, (7)
k
where 1[True] = 1; otherwise, 0. (8)
For ease of notation, let us define the one-hot (i.e., 1-of-K) encoding
y ∈ RK and(9)
.
so that
. (10)
∂l
Q1.1 Assume that you have computed u, h, a, z, given (x, y). Please first express in terms
∂u
∂l
of , u, h, and W(2).
∂a
∂l
=? ∂u
∂l
Then express in terms of z and y.
∂a
∂l
=?
∂a
∂l ∂l ∂l ∂l ∂l
Finally, compute and in terms of and x. Compute in terms of and h.
∂W(1) ∂b(1) ∂u ∂W(2) ∂a
=?
=?
=?
You only need to write down the final answers of the above 5 question marks. You are encouraged to use matrix/vector forms to simplify your answers. Note that max{0,u} is not differentiable w.r.t. u at u = 0. Please note that
,
(11)
which stands for the Heaviside step function. You can use
∂ max{0,u}
= H(u)
(12)
∂u
∂l
in your derivation of .
∂u
You can also use ·∗ to represent element-wise product between two vectors or matrices. For example,
v , where v ∈ RI and c ∈ RI. (13)
Also note that the partial derivatives of the loss function w.r.t. the variables (e.g., a scalar, a vector, or a matrix) will have the same shape as the variables.
What to submit: No more than 5 lines of derivation for each of the 5 partial derivatives.
Q1.2 Suppose we initialize W(1), W(2), b(1) with zero matrices/vectors (i.e., matrices and vectors with all elements set to 0), please first verify that are all zero matrices/vectors,
irrespective of x, y and the initialization of b(2).
Now if we perform stochastic gradient descent for learning the neural network using a training set , please explain with a concise mathematical statement in one sentence why no learning will happen on W(1), W(2), b(1) (i.e., they will not change no matter how many iterations are run). Note that this will still be the case even with weight decay and momentum if the initial velocity vectors/matrices are set to zero.
What to submit: No submission for the verification question. Your concise mathematical statement in one sentence for the explanation question.
Q1.3 As mentioned in the lecture (see lec8.pdf), nonlinearity is very important for neural networks. With nonlinearity (e.g., eq. (3)), the neural network shown in Fig. 1 can bee seen as a nonlinear basis function φ (i.e., φ(x) = h) followed by a linear classifier f (i.e., f(h) = yˆ).
Please show that, by removing the nonlinear operation in eq. (3) and setting eq. (4) to be a = W(2)u + b(2), the resulting network is essentially a linear classifier. More specifically, you can now represent a as Ux + v, where U ∈ RK×D and v ∈ RK. Please write down the representation of U and v using W(1),W(2),b(1), and b(2)
U =? v =?
What to submit: No more than 2 lines of derivation for each of the question mark.
2 Kernel methods
[Recommended maximum time spent: 1 hour]
In the lecture (see lec10.pdf) , we have seen the “kernelization” of regularized least squares problem. The “kernelization” process depends on an important observation: the optimal model parameter can be expressed as a linear combination of the transformed features. You are now to prove a more general case.
Consider a convex loss function in the form `(wTφ(x),y), where φ(x) ∈ RM is a nonlinear feature mapping, and y is a label or a continuous response value.
Now solve the regularized loss minimization problem on a training set D = {(x1,y1),...,(xN,yN)},
(14)
Q2.1 Show that the optimal solution of w can be represented as a linear combination of the training samples. You can assume `(s,y) is differentiable w.r.t. s, i.e. during derivation, you can use the derivative and assume it is a known quantity.
What to submit: Your fewer than 10 line derivation and optimal solution of w.
Q2.2 Assume the combination coefficient is αn for n = 1,...,N. Rewrite loss function Eqn. 14 in terms of αn and kernel function value Kij = k(xi,xj).
What to submit: Your objective function in terms of α and K.
Q2.3 After you obtain the general formulation for Q2.1 and Q2.2, please plug in three different loss functions we have seen so far, and examine what you get.
square loss:
y ∈ R (15)
cross entropy loss:
(16)
perceptron loss:
`(wTφ(x),y) = max(−ywTφ(x),0)
y ∈ {−1,1}
(17)
What to submit: Nothing.
Programming component
3 High-level descriptions
3.1 Dataset
We will use mnist subset (images of handwritten digits from 0 to 9). This is the same subset of the full MNIST that we used for Homework 1. As before, the dataset is stored in a JSON-formated file mnist subset.json. You can access its training, validation, and test splits using the keys ‘train’, ‘valid’, and ‘test’, respectively. For example, suppose we load mnist subset.json to the variable x. Then, x[0train0] refers to the training set of mnist subset. This set is a list with two elements: x[0train0][0] containing the features of size N (samples) ×D (dimension of features), and x[0train0][1] containing the corresponding labels of size N.
3.2 Tasks
You will be asked to implement 10-way classification using multinomial logistic regression (Sect. 4) and neural networks (Sect. 5). Specifically, you will
• finish the implementation of all python functions in our template code.
• run your code by calling the specified scripts to generate output files.
• add, commit, and push (1) all *.py files, and (2) all *.json files that you have created.
In the next two subsections, we will provide a high-level checklist of what you need to do. Furthermore, as in the previous homework, you are not responsible for loading/pre-processing data; we have done that for you. For specific instructions, please refer to text in Sect. 4 and Sect. 5, as well as corresponding python scripts.
3.2.1 Multi-class Classification
Coding In logistic prog.py, finish implementing the following functions: logistic train ovr, logistic test ovr, logistic mul train, and logistic mul test. Refer to logistic prog.py and Sect. 4 for more information.
Running your code Run the script q43.sh after you finish your implementation. This will output logistic res.json.
What to submit Submit both logistic prog.py and logistic res.json.
3.2.2 Neural networks
Preparation Read dnn mlp.py and dnn cnn.py.
Coding
First, in dnn misc.py, finish implementing
• forward and backward functions in class linear layer
• forward and backward functions in class relu
• backward function in class dropout (before that, please read forward function).
Refer to dnn misc.py and Sect. 5 for more information.
Second, in dnn cnn 2.py, finish implementing the main function. There are three TODO items. Refer to dnn cnn 2.py and Sect. 5 for more information.
Running your code Run the scripts q53.sh, q54.sh, q55.sh, q56.sh, q57.sh, q58.sh, q510.sh after you finish your implementation. This will generate, respectively,
MLP lr0.01 m0.0 w0.0 d0.0.json
MLP lr0.01 m0.0 w0.0 d0.5.json
MLP lr0.01 m0.0 w0.0 d0.95.json
LR lr0.01 m0.0 w0.0 d0.0.json CNN lr0.01 m0.0 w0.0 d0.5.json
CNN lr0.01 m0.9 w0.0 d0.5.json
CNN2 lr0.001 m0.9 w0.0 d0.5.json
What to submit Submit dnn misc.py, dnn cnn 2.py, and seven .json files that are generated from the scripts.
3.3 Cautions
Please do not import packages that are not listed in the provided code. Follow the instructions in each section strictly to code up your solutions. Do not change the output format. Do not modify the code unless we instruct you to do so. A homework solution that does not match the provided setup, such as format, name, initializations, etc., will not be graded. It is your responsibility to make sure that your code runs with the provided commands and scripts on the VM. Finally, make sure that you git add, commit, and push all the required files, including your code and generated output files.
3.4 Advice
We are extensively using softmax and sigmoid function in this homework. To avoid numerical issues such as overflow and underflow caused by numpy.exp() and numpy.log(), please use the following implementations:
• Let x be a input vector to the softmax function. Use x˜ = x − max(x) instead of using x directly for the softmax function f, i.e.
• If you are using numpy.log(), make sure the input to the log function is positive. Also, there may be chances that one of the outputs of softmax, e.g. f(x˜i), is extremely small but you need the value ln(f(x˜i)), you can convert the computation into x˜
We have implemented and run the code ourselves without any problems, so if you follow the instructions and settings provided in the python files, you should not encounter overflow or underflow.
4 Multi-class Classification
You will modify 4 python functions in logistic prog.py. First, you will implement two functions that train and test a one-versus-rest multi-class classification model. Second, you will implement two functions that train and test a multinomial logistic regression model. Finally, you will run the command that train and test the two models using your implemented functions, and our code will automatically store your results to logistic res.json.
Coding: One-versus-rest
Q4.1 Implement the code to solve the multi-class classification task with the one-versus-rest strategy. That is, train 10 binary logistic regression models following the setting provided in class: for each class Ck,k = 1,··· ,10, we create a binary classification problem as follows:
• Re-label training samples with label Ck as positive (namely 1)
• Re-label other samples as negative (namely 0)
We wrote functions to load, relabel, and sample the data for you, so you are not responsible for doing it.
Training Finish the implementation of the function logistic train ovr(Xtrain, ytrain, w, b, step size, max iterations). As in the previous homework, we have pre-defined the hyperparameters and initializations in the template code. Moreover, you will use the AVERAGE of gradients from all training samples to update the parameters.
Testing Finish the implementation of the function logistic test ovr(Xtest, w l, b l). This function should return the predicted probability, i.e., the value output by logistic function without thresholding, instead of the 0/1 label. Formally, for each test data point xi, we get its final prediction by ˆyi = argmaxk∈{1,···,10} fk(xi), where ˆyi is the predicted label and fk(xi) is the predicted probability by the kth logistic regression model fk. Then, you compute the classification accuracy as follows:
, (18)
where yi is the ground-truth label of xi and Ntest is the total number of test data instances.
What to do and submit: Your logistic prog.py with completed logistic train ovr and logistic test ovr.
Coding: Multinomial logistic regression
Q4.2 Implement the multinomial logistic regression, training a 10-way classifier (with the softmax function) on mnist subset dataset.
Training Finish the implementation of the function logistic mul train(Xtrain, ytrain, w, b, step size, max iterations). Again, we have pre-defined the hyper-parameters and initializations in the template code. Moreover, you will use the AVERAGE of gradients from all training samples to update the parameters.
Testing Finish the implementation of the function logistic mul test(Xtest, w l, b l) For each test data point xi, compute ˆy = argmaxk∈{1,···,10} p(y = k|x), where p(y = k|x) is the predicted probability by the multinomial logistic regression. Then, compute the accuracy following Eqn. 18.
What to do and submit: Your logistic prog.py with completed logistic mul train and logistic mul test.
Training and getting generated output files from both one-versus-rest and multinomial logistic regression models
Q4.3 What to do and submit: run script q43.sh. It will generate logistic res.json. Add, commit, and push both logistic prog.py and logistic res.json before the due date. What it does: q43.sh will run python3 logistic prog.py. This will train your models (for both Q4.1 and Q4.2 above) and test the trained models (for both Q4.1 and Q4.2 above). The output file stores accuracies of both models.
5 Neural networks: multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs)
In recent years, neural networks have been one of the most powerful machine learning models. Many toolboxes/platforms (e.g., TensorFlow, PyTorch, Torch, Theano, MXNet, Caffe, CNTK) are publicly available for efficiently constructing and training neural networks. The core idea of these toolboxes is to treat a neural network as a combination of data transformation modules. For example, in Fig. 2, the edges correspond to module names of the same neural network shown in Fig. 1 and Sect. 1.
Now we will provide more information on modules for this homework. Each module has its own parameters (but note that a module may have no parameters). Moreover, each module can perform a forward pass and a backward pass. The forward pass performs the computation of the module, given the input to the module. The backward pass computes the partial derivatives of the loss function w.r.t. the input and parameters, given the partial derivatives of the loss function w.r.t. the output of the module. Consider a module hmodule namei. Let hmodule namei.forward and hmodule namei.backward be its forward and backward passes, respectively.
input features predicted label
Figure 2: A diagram of a 1-hidden layer multi-layer perceptron (MLP), with modules indicated on the edges. The circles correspond to variables. The rectangles shown in Fig. 1 are removed for clearness. The term relu stands for rectified linear units.
For example, the linear module may be defined as follows.
forward pass: u = linear(1).forward(x) = W(1)x + b(1), (19)
where W(1) and b(1) are its parameters.
backward pass: [ ] = linear(1).backward(x, . (20)
∂x ∂W(1) ∂b(1) ∂u
Let us assume that we have implemented all the desired modules. Then, getting yˆ for x is equivalent to running the forward pass of each module in order, given x. All the intermediated variables (i.e., u, h, etc.) will all be computed along the forward pass. Similarly, getting the partial derivatives of the loss function w.r.t. the parameters is equivalent to running the backward pass of
∂l
each module in a reverse order, given .
∂z
In this question, we provide a Python environment based on the idea of modules. Every module is defined as a class, so you can create multiple modules of the same functionality by creating multiple object instances of the same class. Your work is to finish the implementation of several modules, where these modules are elements of a multi-layer perceptron (MLP) or a convolutional neural network (CNN). We will apply these models to the same 10-class classification problem introduced in Sect. 4. We will train the models using stochastic gradient descent with minibatch, and explore how different hyperparameters of optimizers and regularization techniques affect training and validation accuracies over training epochs. For deeper understanding, check out, e.g., the seminal work of Yann LeCun et al. “Gradient-based learning applied to document recognition,” written in 1998.
We give a specific example below. Suppose that, at iteration t, you sample a mini-batch of
N examples from the training set (K = 10). Then, the loss of such a mini-batch given by Fig. 2 is (softmax.forward(linear(2).forward(relu.forward(linear(1).forward(xi)))),yi) (21)
(softmax.forward(linear(2).forward(relu.forward(ui))),yi) (22)
= ··· (23)
N
(softmax.forward(ai),yi) (24)
. (25)
That is, in the forward pass, we can perform the computation of a certain module to all the N input examples, and then pass the N output examples to the next module. This is the same case for the backward pass. For example, according to Fig. 2, if we are now to pass the partial derivatives of the loss w.r.t. to linear(2).backward, then
. (26)
linear(2).backward will then compute and pass it back to relu.backward.
Preparation
Q5.1 Please read through dnn mlp.py and dnn cnn.py. Both files will use modules defined in dnn misc.py (which you will modify). Your work is to understand how modules are created, how they are linked to perform the forward and backward passes, and how parameters are updated based on gradients (and momentum). The architectures of the MLP and CNN defined in dnn mlp.py and dnn cnn.py are shown in Fig. 3 and Fig. 4, respectively.
What to submit: Nothing.
Coding: Modules
(1) dropout (2)
Figure 3: The diagram of the MLP implemented in dnn mlp.py. The circles mean variables and edges mean modules.
max pooling dropout
Figure 4: The diagram of the CNN implemented in dnn cnn.py. The circles correspond to variables and edges correspond to modules. Note that the input to CNN may not be a vector (e.g., in dnn cnn.py it is an image, which can be represented as a 3-dimensional tensor). The flatten layer is to reshape its input into vector.
Q5.2 You will modify dnn misc.py. This script defines all modules that you will need to construct the MLP and CNN in dnn mlp.py and dnn cnn.py, respectively. You have three tasks. First, finish the implementation of forward and backward functions in class linear layer. Please follow Eqn. (2) for the forward pass. Second, finish the implementation of forward and backward functions in class relu. Please follow Eqn. (3) for the forward pass and Eqn. (11) for deriving the partial derivatives (note that relu itself has no parameters). Third, finish the the implementation of backward function in class dropout. We define the forward pass and the backward pass as follows.
forward pass: s = dropout.forward( , (27)
where pj is sampled uniformly from [0,1),∀j ∈ {1,··· ,J}, and r ∈ [0,1) is a pre-defined scalar named dropout rate.
∂l
backward pass: = dropout.backward( . (28)
∂q
Note that pj,j ∈ {1,··· ,J} and r are not be learned so we do not need to compute the derivatives w.r.t. to them. Moreover, pj,j ∈ {1,··· ,J} are re-sampled every forward pass, and are kept for the following backward pass. The dropout rate r is set to 0 during testing.
Detailed descriptions/instructions about each pass (i.e., what to compute and what to return) are included in dnn misc.py. Please do read carefully.
Note that in this script we do import numpy as np. Thus, to call a function XX from numpy, please u np.XX.
What to do and submit: Finish the implementation of 5 functions specified above in dnn misc.py. Submit your completed dnn misc.py.
Testing dnn misc.py
Q5.3 What to do and submit: run script q53.sh. It will output MLP lr0.01 m0.0 w0.0 d0.0.json.
Add, commit, and push this file before the due date.
What it does: q53.sh will run python3 dnn mlp.py with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.0. The output file stores the training and validation accuracies over 30 training epochs.
Q5.4 What to do and submit: run script q54.sh. It will output MLP lr0.01 m0.0 w0.0 d0.5.json.
Add, commit, and push this file before the due date.
What it does: q54.sh will run python3 dnn mlp.py --dropout rate 0.5 with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.5. The output file stores the training and validation accuracies over 30 training epochs.
Q5.5 What to do and submit: run script q55.sh. It will output MLP lr0.01 m0.0 w0.0 d0.95.json.
Add, commit, and push this file before the due date.
What it does: q55.sh will run python3 dnn mlp.py --dropout rate 0.95 with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.95. The output file stores the training and validation accuracies over 30 training epochs.
You will observe that the model in Q5.4 will give better validation accuracy (at epoch 30) compared to Q5.3. Specifically, dropout is widely-used to prevent over-fitting. However, if we use a too large dropout rate (like the one in Q5.5), the validation accuracy (together with the training accuracy) will be relatively lower, essentially under-fitting the training data.
Q5.6 What to do and submit: run script q56.sh. It will output LR lr0.01 m0.0 w0.0 d0.0.json. Add, commit, and push this file before the due date.
What it does: q56.sh will run python3 dnn mlp nononlinear.py with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.0. The output file stores the training and validation accuracies over 30 training epochs.
The network has the same structure as the one in Q5.3, except that we remove the relu (nonlinear) layer. You will see that the validation accuracies drop significantly (the gap is around 0.03). Essentially, without the nonlinear layer, the model is learning multinomial logistic regression similar to Q4.2.
Q5.7 What to do and submit: run script q57.sh. It will output CNN lr0.01 m0.0 w0.0 d0.5.json.
Add, commit, and push this file before the due date.
What it does: q57.sh will run python3 dnn cnn.py with learning rate 0.01, no momentum, no weight decay, and dropout rate 0.5. The output file stores the training and validation accuracies over 30 training epochs.
max-p max-p dropout
Figure 5: The diagram of the CNN you are going to implement in dnn cnn 2.py. The term conv stands for convolution; max-p stands for max pooling. The circles correspond to variables and edges correspond to modules. Note that the input to CNN may not be a vector (e.g., in dnn cnn 2.py it is an image, which can be represented as a 3-dimensional tensor). The flatten layer is to reshape its input into vector.
Q5.8 What to do and submit: run script q58.sh. It will output CNN lr0.01 m0.9 w0.0 d0.5.json.
Add, commit, and push this file before the due date.
What it does: q58.sh will run python3 dnn cnn.py --alpha 0.9 with learning rate 0.01, momentum 0.9, no weight decay, and dropout rate 0.5. The output file stores the training and validation accuracies over 30 training epochs.
You will see that Q5.8 will lead to faster convergence than Q5.7 (i.e., the training/validation accuracies will be higher than 0.94 after 1 epoch). That is, using momentum will lead to more stable updates of the parameters.
Coding: Building a deeper architecture
Q5.9 The CNN architecture in dnn cnn.py has only one convolutional layer. In this question, you are going to construct a two-convolutional-layer CNN (see Fig. 5 using the modules you implemented in Q5.2. Please modify the main function in dnn cnn 2.py. The code in dnn cnn 2.py is similar to that in dnn cnn.py, except that there are a few parts marked as TODO. You need to fill in your code so as to construct the CNN in Fig. 5.
What to do and submit: Finish the implementation of the main function in dnn cnn 2.py (search for TODO in main). Submit your completed dnn cnn 2.py.
Testing dnn cnn 2.py
Q5.10 What to do and submit: run script q510.sh. It will output CNN2 lr0.001 m0.9 w0.0 d0.5.json.
Add, commit, and push this file before the due date.
What it does: q510.sh will run python3 dnn cnn 2.py --alpha 0.9 with learning rate 0.01, momentum 0.9, no weight decay, and dropout rate 0.5. The output file stores the training and validation accuracies over 30 training epochs.
You will see that you can achieve slightly higher validation accuracies than those in Q5.8