CS7643 Homework 2 Solution

Starting from:

$10.29

TAs: Sameer Dharur, Joanne Truong, Yihao Chen, Michael Piseno, Hrishikesh Kale, Tianyu Zhan, Pravhav Chawla, Guillermo Nicolas Grande.
Discussions: https://piazza.com/gatech/fall2020/cs48037643
Instructions
1. We will be using Gradescope to collect your assignments. Please read the following instructions for submitting to Gradescope carefully!
• For the HW2 component on Gradescope, you could upload one single PDF containing the answers to all the theory questions and the completed Jupyter notebooks for the coding problems. However, the solution to each problem or subproblem must be on a separate page. When submitting to Gradescope, please make sure to mark the page(s) corresponding to each problem/sub-problem. Likewise, the pages of the Jupyter notebooks must also be marked to their corresponding subproblems.
• For the HW2 Code component on Gradescope, please use the collect_submission.sh script provided and upload the resulting hw2_code.zip here. Please make sure you have saved the most recent version of your Jupyter notebook before running this script.
2. LATEX’d solutions are strongly encouraged (solution template available at cc.gatech.edu/classes/AY2021/cs7643_fall/assets/sol2.tex), but scanned handwritten copies are acceptable. Hard copies are not accepted.
3. We generally encourage you to collaborate with other students.
1 Gradient Descent
1. (3 points) We often use iterative optimization algorithms such as Gradient Descent to find w that minimizes a loss function f(w). Recall that in gradient descent, we start with an initial value of w (say w(1)) and iteratively take a step in the direction of the negative of the gradient of the objective function i.e.
w(t+1) = w(t) − η∇f(w(t)) (1)
for learning rate η > 0.
In this question, we will develop a slightly deeper understanding of this update rule, in particular for minimizing a convex function f(w). Note: this analysis will not directly carry over to training neural networks since loss functions for training neural networks are typically not convex, but this will (a) develop intuition and (b) provide a starting point for research in non-convex optimization (which is beyond the scope of this class).
Recall the first-order Taylor approximation of f at w(t):
f(w) ≈ f(w(t)) + hw − w(t),∇f(w(t))i
When f is convex, this approximation forms a lower bound of f, i.e. (2)
w (3)

| {z }
affine lower bound to f(·)
Since this approximation is a ‘simpler’ function than f(·), we could consider minimizing the approximation instead of f(·). Two immediate problems: (1) the approximation is affine (thus unbounded from below) and (2) the approximation is faithful for w close to w(t). To solve both problems, we add a squared `2 proximity term to the approximation minimization:
argmin(4)
w
affine lower bound to f(·) |{z} {z trade-off proximity term
Notice that the optimization problem above is an unconstrained quadratic programming problem, meaning that it can be solved in closed form (hint: gradients).
What is the solution w∗ of the above optimization? What does that tell you about the gradient descent update rule? What is the relationship between λ and η?
2. (4 points) Let’s prove a lemma that will initially seem devoid of the rest of the analysis but will come in handy in the next sub-question when we start combining things. Specifically, the analysis in this sub-question holds for any w?, but in the next sub-question we will use it for w? that minimizes f(w).
Consider a sequence of vectors v1,v2,...,vT, and an update equation of the form w(t+1) = w(t) − ηvt with w(1) = 0. Show that:
(5)
3. (4 points) Now let’s start putting things together and analyze the convergence rate of gradient descent i.e. how fast it converges to w?.
First, show that for w¯
(6)
Next, use the result from part 2, with upper bounds B and ρ for ||w?|| and respectively and show that for fixed , the convergence rate of gradient descent is
√
O(1/ T) i.e. the upper bound for .
2 Estimating Hessians [Extra credit for 4803 and 7643]
4. (6 points) Optimization is an extremely important part of deep learning. In the previous question, we explored gradient descent, which uses the direction of maximum change to minimize a loss function. However, gradient descent leaves a few questions unresolved – how do we choose the learning rate η? If η is small, we will take a long time to reach the optimal point; if η is large, it will oscillate between one side of the curve and another. So what should we do?
One solution is to use Hessians, which is a measure of curvature, or the rate of change of the gradients. Intuitively, if we knew how steep a curve were, we would know how fast we should move in a given direction. This is the intuition behind second-order optimization methods such as Newton’s method.
Let us formally define a Hessian matrix H of a function f as a square n×n matrix containing all second partial derivatives of f, i.e.:
H
Recall the second-order Taylor approximation of f at w(t):
(7)
(a) What is the solution to the following optimization problem?
argmin(8)
w
What does that tell you about how to set the learning rate η in gradient descent?
Now that we’ve derived Netwon’s update algorithm, we should also mention that there is a catch to using Newton’s method. Newton’s method requires us to 1) calculate H, and 2) invert H. Having to compute a Hessian is expensive; H is massive and we would also have to figure out how to store it.
(b) Consider an MLP with 3 fully-connected layers, each with 50 hidden neurons, except for the output layer, which represents 10 classes. We can represent the transformations as x ∈ R50 −→ h(1) ∈ R50 −→ h(2) ∈ R50 −→ s ∈ R10. Assume that x does not include any bias feature appended to it. How many parameters are in this MLP? What is the size of the corresponding Hessian?
Rather than store and manipulate the Hessian H directly, we will instead focus on being able to compute the result of a Hessian-vector product Hv, where v is an arbitrary vector. Why? Because in many cases one does not need the full Hessian but only Hv. Computing Hv is a core building block for computing a number of quantities including H−1∇f (hint, hint). You will next show a surprising result that it is possible to ‘extract information from the Hessian’, specifically to compute the Hessian-vector product without ever explicitly calculating or storing the Hessian itself!
Consider the Taylor series expansion of the gradient operator about a point in weight space:
∇w(w + ∆w) = ∇w(w) + H∆w + O(||∆w||2) (9)
where w is a point in weight space, ∆w is a perturbation of w, ∇w is the gradient, and ∇w(w + ∆w) is the Jacobian matrix of w + ∆w w.r.t. w.
If you have difficulty understanding this expression above, consider starting with Eqn (2), replacing w − w(t) with ∆w and f(·) with ∇w(·). (c) Use Eqn (9) to derive a numerical approximation of Hv (in terms of ∇w).
Hint: Consider choosing ∆w = rv, where v is a vector and r is a small number.
Let’s now define a useful operator, known as the R-operator. The R-operator with respect to v is defined as:
(10)
(d) The R-operator has many useful properties. Let’s first prove some of them. Show that:
Rv{cf(w)} = cRv{f(w)} [Linearity under scalar multiplication]
Rv{f(w)g(w)} = Rv{f(w)}g(w) + f(w)Rv{g(w)} [Chain Rule of R-operators]
(e) Now, instead of numerically approximating Hv, use the R-operator to derive an equation to exactly calculate Hv.
(f) Explain how might you implement Hv in MLPs if you already have access to an autodifferentiation library.
3 Automatic Differentiation
5. (4 points) In practice, writing the closed-form expression of the derivative of a loss function f
w.r.t. the parameters of a deep neural network is hard (and mostly unnecessary) as f becomes complex. Instead, we define computation graphs and use the automatic differentiation algorithms (typically backpropagation) to compute gradients using the chain rule. For example, consider the expression
f(x,y) = (x + y)(y + 1)
Let’s define intermediate variables a and b such that (11)
a = x + y (12)
b = y + 1 (13)
f = a × b (14)
] A computation graph for the “forward pass” through f is shown in Fig. 1.

Figure 1
We can then work backwards and compute the derivative of f w.r.t. each intermediate variable (∂f∂a, ∂f∂b) and chain them together to get and .
Let σ(·) denote the standard sigmoid function. Now, for the following vector function:
f1(w1,w2) = eew1+e2w2 + sin(ew1 + e2w2) (15)
f2(w1,w2) = w1w2 + σ(w1) (16)
(a) Draw the computation graph. Compute the value of f at w~ = (1,2).
(b) At this w~, compute the Jacobian using numerical differentiation (using ∆w = 0.01).
(c) At this w~, compute the Jacobian using forward mode auto-differentiation.
(d) At this w~, compute the Jacobian using backward mode auto-differentiation.
(e) Don’t you love that software exists to do this for us?
4 Convolutions
6. (5 points) We’ll start to introduce the properties of convolutions here that serve as a foundation for many computer vision applications in deep learning. In class, we discussed convolutions. In this question, we will develop formal intuition around a slight modification of that idea – circular convolutions.
First, let’s define a circular convolution of two n-dimensional vectors x and w:
(17)
We can write the above equation as a matrix-vector multiplication.
Given an n-dimensional vector a = (a0,...,an−1), we define the associated matrix Ca whose first column is made up of these numbers, and each subsequent column is obtained by a circular shift of the previous column.
 a0
 a1
 Ca =  a2
 ...


an−1 an−1 a0 a1
an−2 an−2 an−1 a0
...
an−3 ...
...
... a1 a2 a3
... 
a0
Such matrices are called circulants. Any convolution x∗w can be equivalently represented as a multiplication by the circulant matrix Cax.
Note that a circulant matrix is a kind of Toeplitz matrix with the additional property that ai = ai+n. Next, let’s introduce a special type of circulant called a shift matrix. A shift matrix is a circulant matrix where only one dimension of the vector a can be set to 1, i.e., a = (0,1,...,0). Let S be the circular right-shift operator, defined by the following action on vectors:
0
1
Sx =  
 ... ...
1 1 x0
 x1
 ...


0 xn−1  xn−1
  x0 
 =  ... 
  
  
xn−2
Notice that after applying the shift-matrix, all the element of x have been shifted by 1.
(a) Prove that any circulant matrix is commutative with a shift matrix. Note that this directly implies that convolutions are commutative with shift operators.
This leads to very important property called translation or shift equivariance. A function is shift equivariant if f(Sx) = Sf(x). Convolution’s commutativity with shift implies that it does not matter whether we first shift a vector and then convolve it, or first convolve and then shift – the result will be the same. Notice that you just proved that circular convolutions are shift equivariant.
(b) Now prove that the a (circular) convolution is the only linear operation with shift equivariance. (Hint: how do you prove a bidirectional implication?)
(c) (Open-ended question) What does this tell you about designing deep learning architectures for processing spatial or spatio-temporal data like images and videos?
5 Paper Review [Extra credit for 4803, regular credit for 7643]
The paper we will study in this homework is ‘Understanding deep learning requires rethinking generalization’, presented at the International Conference on Learning Representations (ICLR) in 2017.
The paper presents a set of interesting experiments and results to explore the phenomenon of generalization in deep neural networks, i.e, the difference in performance on the training and test sets, and the role of explicit and implicit regularization towards achieving this.
The paper can be viewed here.
The evaluation rubric for this section is as follows :
7. [2 points] Briefly summarize the key contributions, strengths and weaknesses of this paper.
8. [2 points] What is your personal takeaway from this paper? This could be expressed either in terms of relating the approaches adopted in this paper to your traditional understanding of learning parameterized models, or potential future directions of research in the area which the authors haven’t addressed, or anything else that struck you as being noteworthy.
Guidelines: Please restrict your reviews to no more than 350 words (total length for answers to both the above questions).
6 Implement and train a network on CIFAR-10
9. (Upto 20 points) In PS1, you learned how to implement a softmax classifier and vanilla neural networks. Now, we will learn how to implement ConvNets. You will begin by writing the forward and backward passes for convolution and pooling, and then go on to train a shallow ConvNet on the CIFAR-10 dataset in Python. Next you will learn to use PyTorch, a popular open-source deep learning framework, and use it to replicate the experiments from before.
Follow the instructions provided here

More products