$25
CS 189 Introduction to Machine Learning HW4
you can submit models to Kaggle only twice a day!
1. Kaggle: Submit your predictions to
https://www.kaggle.com/c/spring21-cs189-hw4-wine
2. Write-up: Submit your solution in PDF format to “Homework 4 Write-Up” on Gradescope.
• State your name, and if you have discussed this homework with anyone (other than GSIs), list the names of them all.
• Begin the solution for each question in a new page. Do not put content for different questions in the same page. You may use multiple pages for a question if required.
• If you include figures, graphs or tables for a question, any explanations should accompany them in the same page. Do NOT put these in an appendix!
• Only PDF uploads to Gradescope will be accepted. You may use LATEX or Word to typeset your solution or scan a neatly handwritten solution to produce the PDF.
• Replicate all your code in an appendix. Begin code for each coding question in a fresh page. Do not put code from multiple questions in the same page. When you upload this PDF on Gradescope, make sure that you assign the relevant pages of your code from appendix to correct questions.
3. Code: Additionally, submit all your code as a ZIP to “Homework 4 Code” on Gradescope.
• Set a seed for all pseudo-random numbers generated in your code. This ensures your results are replicated when readers run your code.
• Include a README with your name, student ID, the values of the random seed (above) you used, and any instructions for compilation.
• Do NOT provide any data files, but supply instructions on how to add data to your code.
• Code requiring exorbitant memory or execution time won’t be considered.
• Code submitted here must match that in the PDF Write-up, and produce the exact output submitted to Kaggle. Inconsistent or incomplete code won’t be accepted.
Notation. In this assignment we use the following conventions.
• Symbol “defined equal to” (,) defines the quantity to its left to be the expression to its right.
• Scalars are lowercase non-bold: x,u1,αi. Matrices are uppercase alphabets: A, B1,Ci. Vectors (column vectors) are in bold: x,α1,X,Yj. √
• kvk denotes the Euclidean norm (length) of vector v: kvk , v · v. kAk denotes the (operator) norm of matrix A, the magnitude of its largest singular value: kAk = maxkvk=1 kAvk.
• [n] , {1,2,3,...,n}. 1 and 0 denote the vectors with all-ones and all-zeros, respectively.
1 Honor Code
Declare and sign the following statement:
“I certify that all solutions in this document are entirely my own and that I have not looked at anyone else’s solution. I have given credit to all external sources I consulted.” Signature :
While discussions are encouraged, everything in your solution must be your (and only your) creation. Furthermore, all external material (i.e., anything outside lectures and assigned readings, including figures and pictures) should be cited properly. We wish to remind you that consequences of academic misconduct are particularly severe!
2 Logistic Regression with Newton’s Method
Given examples x1,x2,...,xn ∈ Rd and associated labels y1,y2,...,yn ∈ {0,1}, the cost function for unregularized logistic regression is
n
X
J(w) , − yi ln si + (1 − yi)ln(1 − si)
i=1
where si , s(xi · w), w ∈ Rd is a weight vector, and s(γ) , 1/(1 + e−γ) is the logistic function.
Define the n × d design matrix X (whose ith row is xi ), the label n-vector y , [y1 ... yn], and s , [s1 ... sn]. For an n-vector a, let lna , [lna1 ... lnan]. The cost function can be rewritten in vector form as J(w) = −y · lns − (1 − y) · ln(1 − s).
Hint: Recall matrix calculus identitiesxz y; ∇xf(y) =
; and(whereC is a constant matrix).
1 Derive the gradient ∇wJ(w) of cost J(w) as a matrix-vector expression. Also derive all intermediate derivatives in matrix-vector form. Do NOT specify them in terms of their individual components.
2 Derive the Hessian ) for the cost function J(w) as a matrix-vector expression.
3 Write the matrix-vector update law for one iteration of Newton’s method, substituting the gradient and Hessian of J(w).
4 You are given four examples x1 = [0.2 3.1],x2 = [1.0 3.0],x3 = [−0.2 1.2],x4 = [1.0 1.1] with labels y1 = 1,y2 = 1,y3 = 0,y4 = 0. These points cannot be separated by a line passing through origin. Hence, as described in lecture, append a 1 to each xi∈[4] and use a weight vector w ∈ R3 whose last component is the bias term (called α in lecture). Begin
with initial weight w(0) = h−1 1 0i . For the following, state only the final answer with four digits after the decimal point. You may use a calculator or write a program to solve for these, but do NOT submit any code for this part.
(a) State the value of s(0) (the initial value of s).
(b) State the value of w(1) (the value of w after 1 iteration).
(c) State the value of s(1) (the value of s after 1 iteration).
(d) State the value of w(2) (the value of w after 2 iterations).
3 Wine Classification with Logistic Regression
The wine dataset data.mat consists of 6,497 sample points, each having 12 features. The description of these features is provided in data.mat. The dataset includes a training set of 6,000 sample points and a test set of 497 sample points. Your classifier needs to predict whether a wine is white (class label 0) or red (class label 1).
Begin by normalizing the data with each feature’s mean and standard deviation. You should use training data statistics to normalize both training and validation/test data. Then add a fictitious dimension. Whenever required, it is recommended that you tune hyperparameter values with crossvalidation.
Please set a random seed whenever needed and report it.
Use of automatic logistic regression libraries/packages is prohibited for this question. If you are coding in python, it is better to use scipy.special.expit for evaluating logistic functions as its code is numerically stable, and doesn’t produce NaN or MathOverflow exceptions.
1 Batch Gradient Descent Update. State the batch gradient descent update law for logistic regression with `2 regularization. As this is a “batch” algorithm, each iteration should use every training example. You don’t have to show your derivation. You may reuse results from your solution to question 4.1.
2 Batch Gradient Descent Code. Choose reasonable values for the regularization parameter and step size (learning rate), specify your chosen values, and train your model from question 3.1. Plot the value of the cost function versus the number of iterations spent in training.
3 Stochastic Gradient Descent (SGD) Update. State the SGD update law for logistic regression with `2 regularization. Since this is not a “batch” algorithm anymore, each iteration uses just one training example. You don’t have to show your derivation.
4 Stochastic Gradient Descent Code. Choose a suitable value for the step size (learning rate), specify your chosen value, and run your SGD algorithm from question 3.3. Plot the value of the cost function versus the number of iterations spent in training.
Compare your plot here with that of question 3.2. Which method converges more quickly? Briefly describe what you observe.
5 Instead of using a constant step size (learning rate) in SGD, you could use a step size that slowly shrinks from iteration to iteration. Run your SGD algorithm from question 3.3 with a step size t = δ/t where t is the iteration number and δ is a hyperparameter you select empirically. Mention the value of δ chosen. Plot the value of cost function versus the number of iterations spent in training.
How does this compare to the convergence of your previous SGD code?
6 Kaggle. Train your best classifier on the entire training set and submit your prediction on the test sample points to Kaggle. As always for Kaggle competitions, you are welcome to add or remove features, tweak the algorithm, and do pretty much anything you want to improve your Kaggle leaderboard performance except that you may not replace or augment logistic regression with a wholly different learning algorithm. Your code should output the predicted labels in a CSV file.
Report your Kaggle username and your best score, and briefly describe what your best classifier does to achieve that score.
4 A Bayesian Interpretation of Lasso
Suppose you are aware that the labels yi∈[n] corresponding to sample points xi∈[n] ∈ Rd follow the density law ,w) , √1 e−(y−w·x)2/(2σ2) f(y|x
σ 2π
where σ 0 is a known constant and w ∈ Rd is a random parameter. Suppose further that experts have told you that
• each component of w is independent of the others, and
• each component of w has the Laplace distribution with location 0 and scale being a known constant b. That is, each component wi obeys the density law f(wi) = e−|wi|/b/(2b).
Assume the outputs yi∈[n] are independent from each other.
Your goal is to find the choice of parameter w that is most likely given the input-output examples (xi,yi)i∈[n]. This method of estimating parameters is called maximum a posteriori (MAP); Latin for “maximum [odds] from what follows.”
1. Derive the posterior probability density law f(w|(xi,yi)i∈[n]) for w up to a proportionality constant by applying Bayes’ Theorem and using the densities f(yi|xi,w) and f(w). Don’t try to derive an exact expression for f(w|(xi,yi)i∈[n]), as it is very involved.
2. Define the log-likelihood for MAP as `(w) , ln f(w|xi∈[n],yi∈[n]). Show that maximizing the MAP log-likelihood over all choices of w is the same as minimizing Pni=1(yi −w·xi)2 +λkwk1 where kwk1 = Pdj=1 |wj| and λ is a constant.
5 `1-regularization, `2-regularization, and Sparsity
You are given a design matrix X (whose ith row is sample point xTi ) and an n-vector of labels y , [y1 ... yn]T. For simplicity, assume X is whitened, so XX = nI. Do not add a fictitious dimension/bias term; for input 0, the output is always 0. Let x∗i denote the ith column of X.
1. The `p-norm for w ∈ Rd is defined as ||w||p = (Pid=1 |wi|p)1p , where p 0. Plot the isocontours with w ∈ R2, for the following:
(a) `0.5 (b) `1 (c) `2
Use of automatic libraries/packages for computing norms is prohibited for the question.
2. Show that the cost function for `1-regularized least squares, J1(w) , kXw − yk2 + λkwk1 (where λ 0), can be rewritten as J1(w) = kyk2 ) where f(·, ·) is a suitable function whose first argument is a vector and second argument is a scalar.
3. Using your solution to part 2, derive necessary and sufficient conditions for the ith component of the optimizer w∗ of J1(·) to satisfy each of these three properties: w∗i 0,w∗i = 0, and w∗i < 0.
4. For the optimizer w# of the `2-regularized least squares cost function J2(w) , kXw − yk2 + λkwk2 (where λ 0), derive a necessary and sufficient condition for w#i = 0, where w#i is the ith component of w#.
5. A vector is called sparse if most of its components are 0. From your solution to part 3 and 4, which of w∗ and w# is more likely to be sparse? Why?