$25
Linear regression on a polynomial
The files q1xTrain.npy, q1xTest.npy, q1yTrain.npy and q1yTest.npy specify a linear regression problem for a polynomial. q1xTrain.npy represent the inputs (x(i) ∈ R) and q1yTrain.npy represents the outputs (y(i) ∈ R) of the training set, with one training example per row.
(a) [15 points] You will compare the following two optimization methods, in finding the coefficients of a polynomial of degree one (i.e. slope and intercept) that minimize the training loss.
• Batch gradient descent
• Stochastic gradient descent
i. [12 points] Give the coefficients generated by each of the optimization methods. Report the hyperparameters used to generate the coefficients.
ii. [3 points] We compare two optimization methods in terms of the number of epochs required for convergence. We define an “epoch” as one pass through the entire training samples. Compare the number of epochs to converge for the methods. Which method converges faster? Report the hyperparameters used for comparison. [Hint: in this question, the training process can be viewed as convergent when the mean squared error on the training dataset is consistently small enough (e.g. ≤ 0.2).]
(b) [15 points] Next, you will investigate the problem of over-fitting. Recall the figure from lecture that explored over-fitting as a function of the degree of the polynomial M, where the Root-Mean-Square (RMS) Error is defined as
ERMS = p2E(w∗)/N,
where
:
i. [10 points] Regenerate the above chart with the provided data. To find the parameters, use the closed form solution of linear regression (assuming all the condition is met) that minimize the error for a M-degree polynomial (for M = 0...9) for the training data in q1xTrain.npy and q1yTrain.npy. For the test curve, use the data in q1xTest.npy and q1yTest.npy. [Note: For different values of M, we assume the feature vector is ϕ(x(i)) = (1,x(i),(x(i))2,··· ,(x(i))M). The trend of your curve is not necessarily the same as the sample plot.]
ii. [5 points] Which degree polynomial would you say best fits the data? Was there evidence of under/over-fitting the data? Use your generated charts to justify your answer.
(c) [12 points] Finally, you will explore the role of regularization. Recall the image from lecture that explored the effect of the regularization factor λ:
i. [10 points] Derive the closed form solution of the ridge regression, find the coefficients that minimize the error for a ninth degree polynomial (M = 9) given regularization factor λ (for λ ∈ {0,10−8,10−7,10−6,10−5,...,10−1,100(= 1)}) for the training data specified in q1xTrain.npy and q1yTrain.npy. Now use these parameters to plot the ERMS of both the training data and test data as a function of λ and regenerate the above chart, using q1xTest.npy and q1yTest.npy as the test data). Specifically, use the following regularized objective function:
.
for optimizing the parameters w, but please use the original (unregularized) ERMS for plotting.
[Note: the trend of your curve is not necessarily the same as the sample chart.] ii. [2 points] Which λ value seemed to work the best?
2 [36 points] Locally weighted linear regression
Consider a linear regression problem in which we want to weight different training examples differently. Specifically, suppose we want to minimize
.
where r(i) is the weight for the sample (x(i),y(i)). In class, we worked out what happens for the case where all the weights (the r(i)’s) are the same. In this problem, we will generalize some of those ideas to the weighted setting, and also implement the locally weighted linear regression algorithm. [Note: the weight r(i) can be different for each of the training example.]
(a) [2 points] Show that ED(w) can also be written as
ED(w) = (Xw − y)TR(Xw − y)
for an appropriate diagonal matrix R, and where X is a matrix whose i-th row is x(i) and y is the vector whose i-th entry is y(i). State clearly what R is.
(b) [8 points] If all the r(i)’s equal 1, then we saw in class that the normal equation is
XTXw = XTy,
and that the value of w∗ that minimizes ED(w) is given by (XTX)−1XTy. By finding the derivative ∇wED(w) and setting that to zero, generalize the normal equation and the closed form solution to this weighted setting, and give the new value of w∗ that minimizes ED(w) in closed form as a function of X, R and y.
(c) [8 points] Suppose we have a training set {(x(i),y(i));i = 1...,N} of N independent examples, but in which the y(i)’s were observed with differing variances. Specifically, suppose that
I.e., y(i) has mean wTx(i) and variance (σ(i))2 (where the σ(i)’s are fixed, known, constants). Show that finding the maximum likelihood estimate of w reduces to solving a weighted linear regression problem.
State clearly what the r(i)’s are in terms of the σ(i)’s.
(d) [18 points] The following items will use the files q2x.npy which contains the inputs (x(i)) and q2y.npy which contains the outputs (y(i)) for a linear regression problem, with one training example per row.
i. [4 points] Implement (unweighted) linear regression (y = wTx) on this dataset (using the closed form solution we learned in lectures, remember to include the intercept term.). Plot on the same figure the data (each data sample can be shown as a point (x(i),y(i)) in the figure) and the straight line resulting from your fit.
ii. [8 points] Implement locally weighted linear regression on this dataset (using the weighted normal equations you derived in part (b)), and plot on the same figure the data and the curve resulting from your fit. When evaluating local regression at a query point x (which is real-valued in this problem), use weights
with a bandwidth parameter τ = 0.8. (Again, remember to include the intercept term.)
iii. [6 points] Repeat (ii) four times with τ = 0.1, 0.3, 2 and 10. Comment briefly on what happens to the fit when τ is too small or too large.
3 [22 points] Derivation and Proof
(a) [8 points] Consider the linear regression problem for 1D data, where we would like to learn a function h(x) = w1x + w0 with parameters w0 and w1 to minimize the sum squared error h(x(i)))2 for N pairs of data samples (x(i),y(i)). Derive the solution for w0 and w1 for this 1D case of linear regression. Show the derivation to get the solution w0 = Y¯ − w1X¯ and N where X¯ is the mean of {x(1),x(2),··· ,x(N)} and Y¯ is the mean of {y(1),y(2),··· ,y(N)}.
(b) [14 points] Consider the definition and property of positive (semi-)definite matrix. Let A be a real, symmetric d × d matrix. A is positive semi-definite (PSD) if, for all z ∈ Rd, zTAz ≥ 0. A is positive definite (PD) if, for all z ̸= 0, zTAz > 0. We write A ⪰ 0 when A is PSD, and A ≻ 0 when A is PD. The spectral theorem says that every real symmetric matrix A can be expressed via the spectral decomposition A = UΛUT where U is a d × d matrix such that UUT = UTU = I and Λ = diag(λ1,λ2,··· ,λd). Multiplying on the right by U we see that AU = UΛ. If we let ui denote the i-th column of U, we have Aui = λiui for each i. Then λi are eigenvalues of A, and the corresponding columns are eigenvectors associated to λi. The eigenvalues constitute the “spectrum” of A, and the spectral decomposition is also called the eigenvalue decomposition of A.
i. [6 points] Prove A is PD iff λi > 0 for each i.
ii. [8 points] Consider the linear regression problem where Φ and y are as defined in class and the closed form solution is (ΦTΦ)−1ΦTy. We can get the eigenvalues of symmetric matrix ΦTΦ using spectral decomposition. We apply ridge regression, and the symmetric matrix in the solution is ΦTΦ + βI. Prove that the ridge regression has an effect of shifting all singular values by a constant β. For any β > 0, ridge regression makes the matrix ΦTΦ + βI PD.