$30
COMP9417
Homework 2: Newton’s Method and Mean Squared Error of
Estimators
Introduction In homework 1, we considered Gradient Descent (and coordinate descent) for minimizing a regularized loss function. In this homework, we consider an alternative method known as Newton’s algorithm. We will first run Newton’s algorithm on a simple toy problem, and then implement it from scratch on a real life classification problem. We will also dig deeper into the idea of estimation and comparing estimators based on bias, variance and MSE. Points Allocation There are a total of 28 marks.
What to Submit
1
When and Where to Submit
Question 1. Introduction to Newton’s Method
Note: throughout this question do not use any existing implementations of any of the algorithms discussed unless explicitly asked to in the question. Using existing implementations can result in a grade of zero for the entire question. In homework 1 we studied gradient descent (GD), which is usually referred to as a first order method. Here, we study an alternative algorithm known as Newton’s algorithm, which is generally referred to as a second order method. Roughly speaking, a second order method makes use of both first and second derivatives. Generally, second order methods are much more accurate than first order ones. Given a twice differentiable function g : R→R, Newton’s method generates a sequence {x(k)} iteratively according to the following update rule:
(1)
For example, consider the function with initial guess x(0) = 0. Then
g0(x) = x − cos(x), and g00(x) = 1 + sin(x),
and so we have the following iterations:
and this continues until we terminate the algorithm (as a quick exercise for your own benefit, code this up, plot the function and each of the iterates). We note here that in practice, we often use a different update called the dampened Newton method, defined by:
(2)
Here, as in the case of GD, the step size α has the effect of ‘dampening’ the update.
x(k+1) = x(k) − (H(x(k)))−1∇f(x(k)), k = 0,1,2,..., (3)
where H(x) = ∇2f(x) is the Hessian of f. Explain heuristically (in a couple of sentences) how the above formula is a generalization of equation (1) to functions with vector inputs. what to submit: Some commentary
f(x,y) = 100(y − x2)2 + (1 − x)2.
Create a 3D plot of the function using mplot3d (see lab0 for example). Further, compute the gradient and Hessian of f. what to submit: A single plot, the code used to generate the plot, the gradient and Hessian calculated along with all working. Add a copy of the code to solutions.py
Question 2. Solving Logistic Regression Numerically
Note: throughout this question do not use any existing implementations of any of the algorithms discussed unless explicitly asked to do so in the question. Using existing implementations can result in a grade of zero for the entire question. In this question we will compare gradient descent and Newton’s algorithm for solving the logistic regression problem. Recall that in logistic regresion, our goal is to minimize the log-loss, also referred to as the cross entropy loss. For an intercept β0 ∈ R, parameter vector β = (β1,...,βp) ∈Rp, target yi ∈{0,1}, and feature vector xi = (xi1,xi2,...,xip)T ∈ Rp for i = 1,...,n, the (`2-regularized) log-loss that we will work with is:
,
where σ(z) = (1 + e−z)−1 is the logistic sigmoid, and λ is a hyper-parameter that controls the amount of regularization. Note that you are provided with an implementation of this loss in helper.py.
(k) (k−1)
β0 = β0 − α ׆
(k) (k−1)
β1 = β1 − α ׆
(k) (k−1)
β2 = β2 − α ׆
...
βp(k) = βp(k−1) − α ׆
Make your expression as simple as possible, and be sure to include all your working. Hint: If you haven’t done so already, take a look at Question 4 of Tutorial 3. what to submit: your coordinate level GD updates along with any working.
β(k) = β(k−1) − α ׆
Your expression should only be in terms of β0,β,xi and yi. Next, let γ = [β0,βT]T be the (p + 1)dimensional vector that combines the intercept with the coefficient vector β, write down the update
γ(k) = γ(k−1) − α ׆.
Note: This final expression will be our vectorized implementation of gradient descent. The point of the above exercises is just to be careful about the differences between intercept and non-intercept parameters. Doing GD on the coordinates is extremely innefficient in practice. what to submit: your vectorized GD updates along with any working.
For notational ease, you might find it helpful to write. Hint: When doing this question, first solve it without the regularization term (i.e. ignore the ridge penalty term and take λ = 1). Once you have a form for the Hessian in that case, it should be more straightforward to extend it to the the more general setting needed here. what to submit: your vectorized dampened Newton updates along with any working.
What to submit: the print out of the rows requested in (VI). A copy of your code in solutions.py
A quick aside: Backtracking Line Search: In homework 1, we chose the step-size parameter in our implementations naively by looking at a grid of step-sizes and choosing the best one. In practice, this is obviously not feasible (computational cost of training the model multiple times for a large number of candidate step-sizes). One very popular and empirically successful approach is known as Backtracking Line Search (BLS). In BLS, the step-size is chosen adaptively at each iteration of the algorithm by checking a given criteria. In particular, choose parameters 0 < a ≤ 0.5 and 0 < b < 1. At iteration k of the algorithm (where we are minimizing the function f(x)), the current iterate is x(k), set the step-size to α = 1. Then, while
,
shrink the step-size according to α = bα, otherwise use the current step-size α in your update. We will not explain why this is a sensible thing to do here, but you will be able to find information about it online if you are interested. (Of course, we can also discuss it during one of the consultation sessions). Importantly however, the BLS idea can be applied in both gradient descent and the dampened Newton algorithm, and you will be required to use the BLS in your implementations of both algorithms in the following questions.
0.5, and initalize, where 0p is the p-dimensional vector of zeroes. For your BLS implementation take a = 0.5,b = 0.8. Report your train and test losses, as well as plots of your step-sizes at each iteration, and train loss at each iteration. Hint: if you need a sanity check here, the best thing to do is use sklearn to fit logistic regression models. This should give you an idea of what kind of loss your implementation should be achieving (if your implementation does as well or better, then you are on the right track) what to submit: two plots, one for the step-sizes and the other for the train losses. Report your train and test losses, and a screen shot of any code used in this section, as well as a copy of your code in solutions.py.
Question 3. More on MLE
i.i.d. 2). Recall that in Tutorial 2 we showed that the MLE estimators of µ,σ2 were Let X1,...,Xn ∼ N(µ,σ µˆMLE and σˆMLE2 where
µˆMLE = X, and .
In this question, we will explore these estimators in more depth.
what to submit: the bias and variance of the estimators, along with your working.
.
Discuss whether this estimator is better or worse than the MLE estimator:
.
Be sure to include a detailed analysis of the bias and variance of both estimators, and describe what occurs to each of these quantities (for each of the estimators) as the sample size n increases (use plots). For your plots, you can assume that σ = 1.
what to submit: the bias and variance of the new estimator. A plot comparing the bias of both estimators as a function of the sample size n, a plot comparing the variance of both estimators as a function of the sample size n, use labels/legends in your plots. A copy of the code used here in solutions.py