Starting from:

$30

EN.601.682-Deep Learning Homework 3 Solved

1.    We have talked about backpropagation in class. And here is a supplementary material for calculating the gradient for backpropagation (https://piazza.com/class_profile/get_ resource/jxcftju833c25t/k0labsf3cny4qw). Please study this material carefully before you start this exercise. Suppose P = WX and L = f(P) which is a loss function.

(a)     Please show that  . Show each step of your derivation.

(b)     Suppose the loss function is L2 loss. L2 loss is de ned as L(y,yˆ) = ky − yˆk2 where y is the groundtruth; yˆ is the prediction. Given the following initialization of W and X, please calculate the updated W after one iteration. (step size = 0.1)

 

2.    In this exercise, we will explore how vanishing and exploding gradients a ect the learning process. Consider a simple, 1-dimensional, 3 layer network with data x ∈ R, prediction yˆ ∈ [0,1], true label y ∈ {0,1}, and weights w1,w2,w3 ∈ R, where weights are initialized randomly via ∼ N(0,1). We will use the sigmoid activation function σ between all layers, and the cross entropy loss function L(y,yˆ) = −(y log(ˆy) + (1 − y)log(1 − yˆ)). This network can be represented as: yˆ = σ(w3 · σ(w2 · σ(w1 · x))). Note that for this problem, we are not including a bias term.

(a)     Compute the derivative for a sigmoid. What are the values of the extrema of this derivative, and when are they reached?

(b)     Consider a random initialization of w1 = 0.25,w2 = −0.11,w3 = 0.78, and a sample from the data set (x = 0.63,y = 1). Using backpropagation, compute the gradients for each weight. What have you noticed about the magnitude of the gradient?

Now consider that we want to switch to a regression task and use a similar network structure as we did above: we remove the nal sigmoid activation, so our new network is de ned as yˆ = w3 · σ(w2 · σ(w1 · x)), where predictions yˆ ∈ R and targets y ∈ R; we use the L2 loss function instead of cross entropy: L(y,yˆ) = (y−yˆ)2. Derive the gradient of the loss function with respect to each of the weights w1,w2,w3.

(c)      Consider again the random initialization of w1 = 0.25,w2 = −0.11,w3 = 0.78, and a sample from the data set (x = 0.63,y = 128). Using backpropagation, compute the gradients for each weight. What have you noticed about the magnitude of the gradient?

1

More products