$30
Instructions: Your answers to the questions below, including plots and mathematical work, should be submitted as a single PDF file. It’s preferred that you write your answers using software that typesets mathematics (e.g. LATEX, LYX, or MathJax via iPython), though if you need to you may scan handwritten work. You may find the minted package convenient for including source code in your LATEX document. If you are using LYX, then the listings package tends to work better.
1 Introduction
In this homework you will implement ridge regression using gradient descent and stochastic gradient descent. We’ve provided a lot of support Python code to get you started on the right track. References below to particular functions that you should modify are referring to the support code, which you can download from the website. If you have time after completing the assignment, you might pursue some of the following:
• Study up on numpy’s broadcasting to see if you can simplify and/or speed up your code.
• Think about how you could make the code more modular so that you could easily try different loss functions and step size methods.
• Experiment with more sophisticated approaches to setting the step sizes for SGD (e.g. try out the recommendations in “Bottou’s SGD Tricks” on the website)
• Instead of taking 1 data point at a time, as in SGD, try minibatch gradient descent, where you use multiple points at a time to get your step direction. How does this effect convergence speed? Are you getting computational speedup as well by using vectorized code?
• Advanced: What kind of loss function will give us “quantile regression”?
2 Linear Regression
2.1 Feature Normalization
When feature values differ greatly, we can get much slower rates of convergence of gradient-based algorithms. Furthermore, when we start using regularization (introduced in a later problem), features with larger values are treated as “more important”, which is not usually what you want. One common approach to feature normalization is perform an affine transformation (i.e. shift and rescale) on each feature so that all feature values in the training set are in [0,1]. Each feature gets its own transformation. We then apply the same transformations to each feature on the test[1] set. It’s important that the transformation is “learned” on the training set, and then applied to the test set. It is possible that some transformed test set values will lie outside the [0,1] interval.
Modify function feature_normalization to normalize all the features to [0,1]. (Can you use numpy’s “broadcasting” here?) Note that a feature with constant value cannot be normalized in this way. Your function should discard features that are constant in the training set.
2.2 Gradient Descent Setup
In linear regression, we consider the hypothesis space of linear functions hθ : Rd → R, where
hθ(x) = θTx,
for θ,x ∈ Rd, and we choose θ that minimizes the following “square loss” objective function:
,
where (x1,y1),...,(xm,ym) ∈ Rd × R is our training data.
While this formulation of linear regression is very convenient, it’s more standard to use a hypothesis space of “affine” functions: hθ,b(x) = θTx + b,
which allows a “bias” or nonzero intercept term. The standard way to achieve this, while still maintaining the convenience of the first representation, is to add an extra dimension to x that is always a fixed value, such as 1. You should convince yourself that this is equivalent. We’ll assume this representation, and thus we’ll actually take θ,x ∈ Rd+1.
1. Let X ∈ Rm×(d+1) be the design matrix, where the i’th row of X is xi. Let y =
(y1,...,ym)T ∈ Rm×1 be the “response”. Write the objective function J(θ) as a matrix/vector expression, without using an explicit summation sign. [Being able to write expressions as matrix/vector expressions without summations is crucial to making implementations that are useful in practice, since you can use numpy (or more generally, an efficient numerical linear algebra library) to implement these matrix/vector operations orders of magnitude faster than naively implementing with loops in Python.]
2. Write down an expression for the gradient of J (again, as a matrix/vector expression, without using an explicit summation sign).
3. In our search for a θ that minimizes J, suppose we take a step from θ to θ + ηh, where h ∈ Rd+1 is the “step direction” (recall, this is not necessarily a unit vector) and η ∈ (0,∞) is the “step size” (note that this is not the actual length of the step, which is ηkhk). Use the gradient to write down an approximate expression for the change in objective function value J(θ + ηh) − J(θ). [This approximation is called a “linear” or “first-order” approximation.]
4. Write down the expression for updating θ in the gradient descent algorithm. Let η be the step size.
5. Modify the function compute_square_loss, to compute J(θ) for a given θ. You might want to create a small dataset for which you can compute J(θ) by hand, and verify that your compute_square_loss function returns the correct value.
6. Modify the function compute_square_loss_gradient, to compute ∇θJ(θ). You may again want to use a small dataset to verify that your compute_square_loss_gradient function returns the correct value.
2.3 (OPTIONAL) Gradient Checker
For many optimization problems, coding up the gradient correctly can be tricky. Luckily, there is a nice way to numerically check the gradient calculation. If J : Rd → R is differentiable, then for any vector h ∈ Rd, the directional derivative of J at θ in the direction h is given by[2]
.
We can approximate this directional derivative by choosing a small value of ε > 0 and evaluating the quotient above. We can get an approximation to the gradient by approximating the directional derivatives in each coordinate direction and putting them together into a vector. In other words, take h = (1,0,0,...,0) to get the first component of the gradient. Then take h = (0,1,0,...,0) to get the second component. And so on. See http://ufldl.stanford.edu/wiki/index. php/Gradient_checking_and_advanced_optimization for details.
1. Complete the function grad_checker according to the documentation given. Alternatively, you may complete the function generic_grad_checker so that it works for any objective function. It should take as parameters a function that computes the objective function and a function that computes the gradient of the objective function. Note: Running the gradient checker takes extra time. In practice, once you’re convinced your gradient calculator is correct, you should stop calling the checker so things run faster.
2.4 Batch Gradient Descent3
At the end of the skeleton code, the data is loaded, split into a training and test set, and normalized. We’ll now finish the job of running regression on the training set. Later on we’ll plot the results together with SGD results.
1. Complete batch_gradient_descent.
2. Now let’s experiment with the step size. Note that if the step size is too large, gradient descent may not converge[3]. Starting with a step-size of 0.1, try various different fixed step sizes to see which converges most quickly and/or which diverge. As a minimum, try step sizes 0.5, 0.1, .05, and .01. Plot the value of the objective function as a function of the number of steps for each step size. Briefly summarize your findings.
3. (Optional) Implement backtracking line search (google it). How does it compare to the best fixed step-size you found in terms of number of steps? In terms of time? How does the extra time to run backtracking line search at each step compare to the time it takes to compute the gradient? (You can also compare the operation counts.)
2.5 Ridge Regression (i.e. Linear Regression with `2 regularization)
When we have a large number of features compared to instances, regularization can help control overfitting. Ridge regression is linear regression with `2 regularization. The regularization term is sometimes called a penalty term. The objective function for ridge regression is
where λ is the regularization parameter, which controls the degree of regularization. Note that the bias parameter is being regularized as well. We will address that below.
1. Compute the gradient of J(θ) and write down the expression for updating θ in the gradient descent algorithm. (Matrix/vector expression – no summations please.)
2. Implement compute_regularized_square_loss_gradient.
3. Implement regularized_grad_descent.
4. For regression problems, we may prefer to leave the bias term unregularized. One approach is to change J(θ) so that the bias is separated out from the other parameters and left unregularized. Another approach that can achieve approximately the same thing is to use a very large number B, rather than 1, for the extra bias dimension. Explain why making B large decreases the effective regularization on the bias term, and how we can make that regularization as weak as we like (though not zero).
5. (Optional) Develop a formal statement of the claim in the previous problem, and prove the statement.
6. (Optional) Try various values of B to see what performs best in test.
7. Now fix B = 1. Choosing a reasonable step-size (or using backtracking line search), find the that minimizes J(θ) over a range of λ. You should plot the training loss and the test loss (just the square loss part, without the regularization, in each case) as a function of λ. Your goal is to find λ that gives the minimum test loss. It’s hard to predict what λ that will be, so you should start your search very broadly, looking over several orders of magnitude. For example, . Once you find a range that works better, keep zooming in. You may want to have log(λ) on the x-axis rather than λ. [If you like, you may use sklearn to help with the hyperparameter search.]
8. What θ would you select for deployment and why?
2.6 Stochastic Gradient Descent
When the training data set is very large, evaluating the gradient of the objective function can take a long time, since it requires looking at each training example to take a single gradient step. When the objective function takes the form of an average of many values, such as
(as it does in the empirical risk), stochastic gradient descent (SGD) can be very effective. In SGD, rather than taking −∇J(θ) as our step direction, we take −∇fi(θ) for some i chosen uniformly at random from {1,...,m}. The approximation is poor, but we will show it is unbiased.
In machine learning applications, each fi(θ) would be the loss on the ith example (and of course we’d typically write n instead of m, for the number of training points). In practical implementations for ML, the data points are randomly shuffled, and then we sweep through the whole training set one by one, and perform an update for each training example individually. One pass through the data is called an epoch. Note that each epoch of SGD touches as much data as a single step of batch gradient descent. You can use the same ordering for each epoch, though optionally you could investigate whether reshuffling after each epoch affects the convergence speed.
1. Show that the objective function
can be written in the form by giving an expression for fi(θ) that makes the two expressions equivalent.
2. Show that the stochastic gradient ∇fi(θ), for i chosen uniformly at random from {1,...,m}, is an unbiased estimator of ∇J(θ). In other words, show that E[∇fi(θ)] = ∇J(θ) for any θ. (Hint: It will be easier, notationally, to prove this for a general , rather than the specific case of ridge regression. You can start by writing down an expression for E[∇fi(θ)]...)
3. W rite down the update rule for θ in SGD for the ridge regression objective function.
4. Implement stochastic_grad_descent. (Note: You could potentially generalize the code you wrote for batch gradient to handle minibatches of any size, including 1, but this is not necessary.)
5. Use SGD to find θλ∗ that minimizes the ridge regression objective for the λ and B that you selected in the previous problem. (If you could not solve the previous problem, choose λ = 10−2 and B = 1). Try a few fixed step sizes (at least try ηt ∈ {0.05,.005}. Note that SGD may not converge with fixed step size. Simply note your results. Next try step sizes that decrease with the step number according to the following schedules: and
. For each step size rule, plot the value of the objective function (or the log of the objective function if that is more clear) as a function of epoch (or step number, if you prefer) for each of the approaches to step size. How do the results compare? Some things to note: 1) In this case we are investigating the convergence rate of the optimization algorithm with different step size schedules, thus we’re interested in the value of the objective function, which√
includes the regularization term. 2) Sometimes the initial step size (1 for 1/t and 1/ t) is too aggressive and will get you into a part of parameter space from which you can’t recover. Although in this problem, this probably means you implemented something poorly, in general you can try starting at t = 10, for example, to skip the large initial steps. 3) As we’ll learn in an upcoming lecture, SGD convergence is much slower than GD once we get close to the minimizer. (Remember, the SGD step directions are very noisy versions of the GD step direction). If you look at the objective function values on a logarithmic scale, it may look like SGD will never find objective values that are as low as GD gets. In terminology we’ll learn in Lecture 2, GD has much smaller “optimization error” than SGD. However, this difference in optimization error is usually dominated by other sources of error (estimation error and approximation error). Moreover, for very large datasets, SGD (or minibatch GD) is much faster (by wall-clock time) than GD to reach a point that’s close [enough] to the minimizer.
6. (Optional) Try a stepsize rule of the form , where λ is your regularization constant, and η0 a constant you can choose. How do the results compare?
3 Risk Minimization
3.1 Square Loss
1. Let y be a random variable with a known distribution, and consider the square loss function `(a,y) = (a − y)2. We want to find the action a∗ that has minimal risk. That is, we want to find a∗ = argmina E(a − y)2, where the expectation is with respect to y. Show that a∗ = Ey, and the Bayes risk (i.e. the risk of a∗) is Var(y). In other words, if you want to try to predict the value of a random variable, the best you can do (for minimizing expected square loss) is to predict the mean of the distribution. Your expected loss for predicting the mean will be the variance of the distribution. [Hint: Recall that Var(y) = Ey2 − (Ey)2.]
2. Now let’s introduce an input. Recall that the expected loss or “risk” of a decision function f : X → A is
R(f) = E`(f(x),y),
where (x,y) ∼ PX×Y, and the Bayes decision function f∗ : X → A is a function that achieves the minimal risk among all possible functions:
R(f∗) = inf R(f).
f
Here we consider the regression setting, in which A = Y = R. We will show for the square loss `(a,y) = (a − y)2, the Bayes decision function is f∗(x) = E[y | x], where the expectation is over y. As before, we assume we know the data-generating distribution PX×Y.
(a) We’ll approach this problem by finding the optimal action for any given x. If somebody tells us x, we know that the corresponding y is coming from the conditional distribution y | x. For a particular x, what value should we predict (i.e. what action a should we produce) that has minimal expected loss? Express your answer as a decision function f(x), which gives the best action for any given x. In mathematical notation, we’re looking for f∗(x) = argmina Eh(a − y)2 | xi, where the expectation is with respect to y. (Hint: There is really nothing to do here except write down the answer, based on the previous question. But make sure you understand what’s happening...)
(b) In the previous problem we produced a decision function f∗(x) that minimized the risk for each x. In other words, for any other decision function f(x), f∗(x) is going to be at least as good as f(x), for every single x. In math, we mean
Eh(f∗(x) − y)2 | xi ≤ Eh(f(x) − y)2 | xi,
for all x. To show that f∗(x) is the Bayes decision function, we need to show that
Eh(f∗(x) − y)2i ≤ Eh(f(x) − y)2i
for any f. Explain why this is true. (Hint: Law of iterated expectations.)
3.2 [Optional] Median Loss
1. (Optional) Show that for the absolute loss `(y,yˆ ) = |y − yˆ|, f∗(x) is a Bayes decision function if f∗(x) is the median of the conditional distribution of y given x. [Hint: As in the previous section, consider one x at time. It may help to use the following characterization of a median: m is a median of the distribution for random variable and
Note: This loss function leads to “median regression”. There are other loss functions that lead to “quantile regression” for any chosen quantile. (For partial credit, you may assume that the distribution of y | x is discrete or continuous. For full credit, no assumptions about the distribution.)
[1] Throughout this assignment we refer to the “test” set. It may be more appropriate to call this set the “validation” set, as it will be a set of data on which we compare the performance of multiple models. Typically a test set is only used once, to assess the performance of the model that performed best on the validation set.
[2] Of course, it is also given by the more standard definition of directional derivative, .
The form given gives a better approximation to the derivative when we are using small (but not infinitesimally small)
Sometimes people say “batch gradient descent” or “full batch gradient descent” to mean gradient descent, defined as we discussed in class. They do this to distinguish it from stochastic gradient descent and minibatch gradient descent, which they probably use as their default.
[3] For the mathematically inclined, there is a theorem that if the objective function is convex and differentiable, and the gradient of the objective is Lipschitz continuous with constant L > 0, then gradient descent converges for fixed steps of size 1/L or smaller. See https://www.cs.cmu.edu/~ggordon/10725-F12/scribes/10725_ Lecture5.pdf, Theorem 5.1.