$35
Visualizing Gradients
1. On the left is a 3D plot of f (x, y) — @ — + (y — 3) 2 . On the right is a plot of its gradient field. Note that the arrows show the relative magnitudes of the gradient vector.
(a) From the visualization, what do you think is the minimal value of this function and where does it occur?
(b) Calculate the gradient V f =
(c) When Vf 6, what are the values of and y?
1
2
Gradient Descent Algorithm
2. Given the following loss function and = [Xi i=l, y = and Ot , explicitly write out the update equation for 9t+ l in terms of :ti, Yi, 9t , and a, where a is the constant learning rate.
L(O, i, j) = — — log(Yi))
Convexity
3. Convexity allows optimization problems to be solved more efficiently and for global optimums to be realized. Mainly, it gives us a nice way to minimize loss (i.e. gradient descent). There are three ways to informally define convexity.
a. Walking in a straight line between points on the function keeps you at or above the function. This works for any function.
b. The tangent line at any point lies at or below the function, globally. To use this definition, the function must be differentiable.
c. The second derivative is non-negative everywhere (in other words, the function is "concave up" everywhere). To use this definition, the function must be twice differentiable.
Is the function described in Question 1 convex? Make an argument visually.
3
GPA Descent
4. Consider the following non-linear model with two parameters:
f9(x) = 90 • 0.5 +00 • 01 • + sin(91) • .T2
For some nonsensical reason, we decide to use the residuals of our model as the loss function.
That is, the loss for a single observation is
We want to use gradient descent to determine the optimal model parameters, 90 and 91.
(a) Suppose we have just one observation in our training data, (Xl = 1, .T2
Assume that we set the learning rate a to 1. An incomplete version of the gradient descent update equation for O is shown below. 9!) and 91(t) denote the guesses for 90 and 91 at timestep t, respectively.
9(t+1)
9(t+1)
Express both A and B in terms of Of), , and any necessary constants.
(b) Assume we initialize both 080) and 91(0 to O. Determine 981 ) and Of) (i.e. the guesses for 90 and 91 after one iteration of gradient descent).
(c) What happens to 90 as t + 00 (i.e. as we run more and more iterations of gradient descent)?