$25
Neural Networks
For this assignment, you are asked to implement neural networks. You will use this neural network to classify MNIST database of handwritten digits (0-9). The architecture of the neural network you will implement is based on the multi-layer perceptron (MLP, just another term for fully connected feedforward networks we discussed in the lecture), which is shown as following. It is designed for a K-class classification problem.
Let (x∈RD,y∈{1,2,⋯,K})(x∈RD,y∈{1,2,⋯,K}) be a labeled instance, such an MLP performs the following computations:
input features:linear(1):tanh:relu:linear(2):softmax:predicted label:x∈RDu=W(1)x+b(1),W(1)∈RM×D and b(1)∈RMh=21+e−2u−1h=max{0,u}=⎡⎣⎢⎢max{0,u1}⋮max{0,uM}⎤⎦⎥⎥a=W(2)h+b(2),W(2)∈RK×M and b(2)∈RKz=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢ea1∑keak⋮eaK∑keak⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥y^=argmaxkzk.input features:x∈RDlinear(1):u=W(1)x+b(1),W(1)∈RM×D and b(1)∈RMtanh:h=21+e−2u−1relu:h=max{0,u}=[max{0,u1}⋮max{0,uM}]linear(2):a=W(2)h+b(2),W(2)∈RK×M and b(2)∈RKsoftmax:z=[ea1∑keak⋮eaK∑keak]predicted label:y^=argmaxkzk.
For a KK-class classification problem, one popular loss function for training (i.e., to learn W(1)W(1), W(2)W(2), b(1)b(1), b(2)b(2)) is the cross-entropy loss. Specifically we denote the cross-entropy loss with respect to the training example (x,y)(x,y) by ll:
l=−log(zy)=log(1+∑k≠yeak−ay)l=−log(zy)=log(1+∑k≠yeak−ay)
Note that one should look at ll as a function of the parameters of the network, that is, W(1),b(1),W(2)W(1),b(1),W(2) and b(2)b(2). For ease of notation, let us define the one-hot (i.e., 1-of-KK) encoding of a class yy as
y∈RK and yk={1, if y=k,0, otherwise.y∈RK and yk={1, if y=k,0, otherwise.
so that
l=−∑kyklogzk=−yT⎡⎣⎢⎢logz1⋮logzK⎤⎦⎥⎥=−yTlogz.l=−∑kyklogzk=−yT[logz1⋮logzK]=−yTlogz.
We can then perform error-backpropagation, a way to compute partial derivatives (or gradients) w.r.t the parameters of a neural network, and use gradient-based optimization to learn the parameters.
Submission: You need to submit both neural_networks.py and utils.py.
Q1. Mini batch Stochastic Gradient Descent
First, you need to implement mini-batch stochastic gradient descent which is a gradient-based optimization to learn the parameters of the neural network.
You need to realize two alternatives for SGD, one without momentum and one with momentum. We will pass a variable αα to indicate which option. When α≤0α≤0, the parameters are updated just by gradient. When α0α0, the parameters are updated with momentum and αα will also represents the discount factor as following:
υ=αυ−ηδtwt=wt−1+υυ=αυ−ηδtwt=wt−1+υ
You can use the formula above to update the weights.
Here, αα is the discount factor such that α∈(0,1)α∈(0,1). It is given by us and you do not need to adjust it.
ηη is the learning rate. It is also given by us.
υυ is the velocity update (A.K.A momentum update). δtδt is the gradient
TODO 1 You need to complete def miniBatchStochasticGradientDescent(model, momentum, _lambda, _alpha, _learning_rate) in neural_networks.py
Notice that for a complete mini-batch SGD, you will also need to find the best size of mini-batch and number of epochs. In this assignment, we omit this step. Both size of mini-batch and number of epoch has already been given. You do not need to adjust them.
Q2. Linear Layer (15 points)
Second, you need to implement the linear layer of MLP. In this part, you need to implement 3 python functions in class linear_layer.
In the function def __init__(self, input_D, output_D), you need to initialize W with random values using np.random.normal such that the mean is 0 and standard deviation is 0.1. You also need to initialize gradients to zeroes in the same function.
forward pass:backward pass:u=linear(1).forward(x)=W(1)x+b(1),where W(1) and b(1) are its parameters.[∂l∂x,∂l∂W(1),∂l∂b(1)]=linear(1).backward(x,∂l∂u).forward pass:u=linear(1).forward(x)=W(1)x+b(1),where W(1) and b(1) are its parameters.backward pass:[∂l∂x,∂l∂W(1),∂l∂b(1)]=linear(1).backward(x,∂l∂u).
You can use the above formula as a reference to implement the def forward(self, X) forward pass and def backward(self, X, grad) backward pass in class linear_layer. In backward pass, you only need to return the backward_output. You also need to compute gradients of W and b in backward pass.
TODO 2 You need to complete def __init__(self, input_D, output_D) in class linear_layer of neural_networks.py
TODO 3 You need to complete def forward(self, X) in class linear_layer of neural_networks.py
TODO 4 You need to complete def backward(self, X, grad) in class linear_layer of neural_networks.py
Q3. Activation function - tanh (15 points)
Now, you need to implement the activation function tanh. In this part, you need to implement 2 python functions in class tanh. In def forward(self, X), you need to implement the forward pass and you need to compute the derivative and accordingly implement def backward(self, X, grad), i.e. the backward pass.
tanh:h=21+e−2u−1tanh:h=21+e−2u−1
You can use the above formula for tanh as a reference.
TODO 5 You need to complete def forward(self, X) in class tanh of neural_networks.py
TODO 6 You need to complete def backward(self, X, grad) in class tanh of neural_networks.py
Q4. Activation function - relu (15 points)
You need to implement another activation function called relu. In this part, you need to implement 2 python functions in class relu. In def forward(self, X), you need to implement the forward pass and you need to compute the derivative and accordingly implement def backward(self, X, grad), i.e. the backward pass.
relu:h=max{0,u}=⎡⎣⎢⎢max{0,u1}⋮max{0,uM}⎤⎦⎥⎥relu:h=max{0,u}=[max{0,u1}⋮max{0,uM}]
You can use the above formula for relu as a reference.
TODO 7 You need to complete def forward(self, X) in class relu of neural_networks.py
TODO 8 You need to complete def backward(self, X, grad) in class relu of neural_networks.py
Q5. Dropout (15 points)
To prevent overfitting, we usually add regularization. Dropout is another way of handling overfitting. In this part, you will initially read and understand def forward(self, X, is_train) i.e. the forward pass of class dropout. You will also derive partial derivatives accordingly to implement def backward(self, X, grad) i.e. the backward pass of class dropout.
Now we take an intermediate variable q∈RJq∈RJ which is the output from one of the layers. Then we define the forward and the backward passes in dropout as follows.
The forward pass obtains the output after dropout.
forward pass:s=dropout.forward(q∈RJ)=11−r×⎡⎣⎢⎢1[p1=r]×q1⋮1[pJ=r]×qJ⎤⎦⎥⎥,where pj is generated randomly from [0,1),∀j∈{1,⋯,J},and r∈[0,1) is a pre-defined scalar named dropout rate which is given to you.forward pass:s=dropout.forward(q∈RJ)=11−r×[1[p1=r]×q1⋮1[pJ=r]×qJ],where pj is generated randomly from [0,1),∀j∈{1,⋯,J},and r∈[0,1) is a pre-defined scalar named dropout rate which is given to you.
The backward pass computes the partial derivative of loss with respect to qq from the one with respect to the forward pass result, which is ∂l∂s∂l∂s.
backward pass:∂l∂q=dropout.backward(q,∂l∂s)=11−r×⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢1[p1=r]×∂l∂s1⋮1[pJ=r]×∂l∂sJ⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥.backward pass:∂l∂q=dropout.backward(q,∂l∂s)=11−r×[1[p1=r]×∂l∂s1⋮1[pJ=r]×∂l∂sJ].
Note that pj,j∈{1,⋯,J}pj,j∈{1,⋯,J} and rr are not be learned so we do not need to compute the derivatives w.r.t. to them. You do not need to find the best rr since we have picked it for you. Moreover, pj,j∈{1,⋯,J}pj,j∈{1,⋯,J} are re-sampled every forward pass, and are kept for the corresponding backward pass.
TODO 9 You need to complete def backward(self, X, grad) in class dropout of neural_networks.py
Q6. Connecting the dots
In this part, you will combine the modules written from question Q1 to Q5 by implementing TODO snippets in the def main(main_params, optimization_type="minibatch_sgd") i.e. main function. After implementing forward and backward passes of MLP layers in Q1 to Q5,now in the main function you will call the forward methods and backward methods of every layer in the model in an appropriate order based on the architecture.
TODO 10 You need to complete main(main_params, optimization_type="minibatch_sgd") in neural_networks.py