$20
Question 1 (4-4-4). In this question you will demonstrate that an estimate of the first moment of the gradient using an (exponential) running average is equivalent to using momentum, and is biased by a scaling factor. The goal of this question is for you to consider the relationship between different optimization schemes, and to practice noting and quantifying the effect (particularly in terms of bias/variance) of estimating a quantity.
Let gt be an unbiased sample of gradient at time step t and ∆θt be the update to be made. Initialize v0 to be a vector of zeros.
1. For t ≥ 1, consider the following update rules:
• SGD with momentum:
v ∆θt = −vt
where and α ∈ (0,1).
• SGD with running average of gt:
vt = βvt−1 + (1 − β)gt ∆θt = −δvt
where β ∈ (0,1) and δ 0.
Express the two update rules recursively (∆θt as a function of ∆θt−1). Show that these two update rules are equivalent; i.e. express as a function of (β,δ).
2. Unroll the running average update rule, i.e. express vt as a linear combination of gi’s (1 ≤ i ≤ t).
3. Assume gt has a stationary distribution independent of t. Show that the running average is biased, i.e. E[vt] 6= E[gt]. Propose a way to eliminate such a bias by rescaling vt.
and we will have an unbiased estimation of running average.
Question 2 (7-5-5-3). The point of this question is to understand and compare the effects of different regularizers (specifically dropout and weight decay) on the weights of a network. Consider a linear regression problem with input data X ∈ Rn×d, weights w ∈ Rd×1 and targets y ∈ Rn×1. Suppose that dropout is applied to the input (with probability 1−p of dropping the unit i.e. setting it to 0). Let R ∈ Rn×d be the dropout mask such that Rij ∼ Bern(p) is sampled i.i.d. from the
Bernoulli distribution.
For a squared error loss function with dropout, we then have:
1. Let Γ be a diagonal matrix with . Show that the expectation (over R) of the loss function can be rewritten as E[L(w)] = ||y − pXw||2 + p(1 − p)||Γw||2. Hint: Note we are trying to find the expectation over a squared term and use Var(Z) = E[Z2] − E[Z]2.
2. Show that the solution wdropout that minimizes the expected loss from question 2.1 satisfies
pwdropout = (XX + λdropoutΓ2)−1Xy
where λdropout is a regularization coefficient depending on p. How does the value of p affect the regularization coefficient, λdropout ?
3. Express the loss function for a linear regression problem without dropout and with L2 regularization, with regularization coefficient λL2. Derive its closed form solution wL2.
4. Compare the results of 2.2 and 2.3: identify specific differences in the equations you arrive at, and discuss qualitatively what the equations tell you about the similarities and differences in the effects of weight decay and dropout (1-3 sentences).
Question 3 (6-10-2). The goal of this question is for you to understand the reasoning behind different parameter initializations for deep networks, particularly to think about the ways that the initialization affects the activations (and therefore the gradients) of the network. Consider the following equation for the t-th layer of a deep network:
h(t) = g(a(t)) a(t) = W (t)h(t−1) + b(t)
where a(t) are the pre-activations and h(t) are the activations for layer t, g is an activation function, W (t) is a d(t) × d(t−1) matrix, and b(t) is a d(t) × 1 bias vector. The bias is initialized as a constant vector b(t) = [c,..,c] for some c ∈ R, and the entries of the weight matrix are initialized by sampling
i.i.d. from a Gaussian distribution .
Your task is to design an initialization scheme that would achieve a vector of pre-activations at layer t whose elements are zero-mean and unit variance (i.e.: and ,
1 ≤ i ≤ d(t)) for the assumptions about either the activations or pre-activations of layer t−1 listed below. Note we are not asking for a general formula; you just need to provide one setting that meets these criteria (there are many possiblities).
1. First assume that the activations of the previous layer satisfy and for 1 ≤ i ≤ d(t−1). Also, assume entries of h(t−1) are uncorrelated (the answer should not depend on g).
(a) Show Var(XY ) = Var(X)Var(Y ) + Var(X)E[Y ]2 + Var(Y )E[X]2 when X ⊥ Y
(b) Write and in terms of .
(c) Give values for c, µ, and σ2 as a function of d(t−1) such that and for 1 ≤ i ≤ d(t).
2. Now assume that the pre-activations of the previous layer satisfy
1 and has a symmetric distribution for 1 ≤ i ≤ d(t−1). Assume entries of a(t−1) are uncorrelated. Consider the case of ReLU activation: g(x) = max{0,x}.
(a) Derive
(b) Using the result from (a), give values for c, µ, and σ2 as a function of d(t−1) such that and for 1 ≤ i ≤ d(t).
(c) What popular initialization scheme has this form?
(d) Why do you think this initialization would work well in practice? Answer in 1-2 sentences.
3. For both assumptions (1,2) give values such that and
Question 4 (4-6-6). This question is about normalization techniques.
1. Batch normalization, layer normalization and instance normalization all involve calculating the mean µ and variance σ2 with respect to different subsets of the tensor dimensions. Given the following 3D tensor, calculate the corresponding mean and variance tensors for each normalization technique: µbatch, µlayer, µinstance, σbatch2 , σlayer2 , and σinstance2 .
The size of this tensor is 4 x 2 x 3 which corresponds to the batch size, number of channels, and number of features respectively.
2. For the next two subquestions, we consider the following parameterization of a weight vector w:
u
w := γ
||u||
where γ is scalar parameter controlling the magnitude and u is a vector controlling the direction of w.
Consider one layer of a neural network, and omit the bias parameter. To carry out batch normalization, one normally standardizes the preactivation and performs elementwise scale and shift where y = ux. Assume the data x (a random vector) is whitened (Var(x) = I)
and centered at 0 (E[x] = 0). Show that yˆ = wx + β.
3. Show that the gradient of a loss function L(u,γ,β) with respect to u can be written in the form
∇uL = sW ⊥∇wL for some s, where W ⊥ . Note that[1] W ⊥u = 0.
[1] . As a side note: W⊥ is an orthogonal complement that projects the gradient away from the direction of w, which is usually (empirically) close to a dominant eigenvector of the covariance of the gradient. This helps to condition the landscape of the objective that we want to optimize.
.
(16) Question 5 (4-6-4). This question is about activation functions and vanishing/exploding gradients in recurrent neural networks (RNNs). Let σ : R → R be an activation function. When the argument is a vector, we apply σ element-wise. Consider the following recurrent unit:
ht = Wσ(ht−1) + Uxt + b
1. Show that applying the activation function in this way is equivalent to the conventional way of applying the activation function: gt = σ(Wgt−1 + Uxt + b) (i.e. express gt in terms of ht). More formally, you need to prove it using mathematical induction. You only need to prove the induction step in this question, assuming your expression holds for time step t − 1.
∗2. Let ||A|| denote the L2 operator norm[1] of matrix A (||A|| := maxx:||x||=1 ||Ax||). Assume σ(x) has bounded derivative, i.e. |σ0| ≤ γ for some γ 0 and for all x. We denote as λ1(·) the largest eigenvalue of a symmetric matrix. Show that if the largest eigenvalue of the weights is bounded by for some 0 ≤ δ < 1, gradients of the hidden state will vanish over time, i.e.
as T → ∞
Use the following properties of the L2 operator norm
||AB|| ≤ ||A||||B|| and ||A|| = pλ1(AA)
3. What do you think will happen to the gradients of the hidden state if the condition in the previous question is reversed, i.e. if the largest eigenvalue of the weights is larger than ? Is this condition necessary or sufficient for the gradient to explode? (Answer in 1-2 sentences).
[1] . The L2 operator norm of a matrix A is is an induced norm corresponding to the L2 norm of vectors. You can try to prove the given properties as an exercise.
Question 6 (4-8-8). Consider the following Bidirectional RNN:
h
h
where the superscripts f and b correspond to the forward and backward RNNs respectively and σ denotes the logistic sigmoid function. Let zt be the true target of the prediction yt and consider the sum of squared loss L = Pt Lt where .
In this question our goal is to obtain an expression for the gradients ∇W(f)L and ∇U(b)L.
1. First, complete the following computational graph for this RNN, unrolled for 3 time steps (from t = 1 to t = 3). Label each node with the corresponding hidden unit and each edge with the corresponding weight. Note that it includes the initial hidden states for both the forward and backward RNNs.
Figure 1 – Computational graph of the bidirectional RNN unrolled for three timesteps.
2. Using total derivatives we can express the gradients ∇ (f)L and ∇ (b)L recursively in terms of
ht ht
∇ (f) L and ∇ (b) L as follows:
ht+1 ht−1
∇
h
∇
h
(f) (b)
∂ht+1 ∂ht−1
Derive an expression for ∇h(f)Lt, ∇ht(b)Lt, ∂h(tf) and ∂ht(b) .
t
3. Now derive ∇W(f)L and ∇U(b)L as functions of and , respectively.
Hint: It might be useful to consider the contribution of the weight matrices when computing the recurrent hidden unit at a particular time t and how those