Starting from:

$20.99

CS771-Homework 1: Intro to Machine Learning Solved

Additional Instructions
We have provided a LaTeX template file hw1sol.tex to help typeset your PDF writeup. There is also a style file ml.sty that contain shortcuts to many of the useful LaTeX commends for doing things such as boldfaced/calligraphic fonts for letters, various mathematical/greek symbols, etc., and others. Use of these shortcuts is recommended (but not necessary).
Your answer to every question should begin on a new page. The provided template is designed to do this automatically. However, if it fails to do so, use the \clearpage option in LaTeX before starting the answer to a new question, to enforce 
While submitting your assignment on the Gradescope website, you will have to specify on which page(s) is question 1 answered, on which page(s) is question 2 answered etc. To do this properly, first ensure that the answer to each question starts on a different page.
Be careful to flush all your floats (figures, tables) corresponding to question n before starting the answer to question n + 1 otherwise, while grading, we might miss your important parts of your answers.
Your solutions must appear in proper order in the PDF file i.e. solution to question n must be complete in the PDF file (including all plots, tables, proofs etc) before you present a solution to question n + 1.
Problem 1 (15 marks)
(Absolute Loss Regression with Sparsity) The absolute loss regression problem with `1 regularization is

N

wopt = argminX|yn − w>xn| + λ||w||1

w

n=1

where is the absolute value function, and λ > 0 is the regularization hyperparameter.

Is the above objective function convex? You don’t need to prove this formally; just a brief reasoning based on properties of other functions that are known to be convex/non-convex would be fine.

Derivate the expression for the (sub)gradient vector for this model.

Problem 2 (15 marks)
(Feature Masking as Regularization) Consider linear regression model by minimizing the squared loss function. Suppose we decide to mask out or “drop” each feature xnd of each input xn ∈ RD, independently, with probability 1 − p (equivalently, retaining the feature with probability p). Masking or dropping out basically means that we will set the feature xnd to 0 with probability 1 − p. Essentially, it would be equivalent to replacing each input xn by x˜n = xn ◦mn, where ◦ denotes elementwise product and mn denotes the D×1 binary mask vector with mnd ∼ Bernoulli(p) (mnd = 1 means the feature xnd was retained; mnd = 0 means the feature xnd was masked/zeroed).

Let us now define a new loss function using these masked inputs as follows: . Show that minimizing the expected value of this new loss function (where the expectation is used since the mask vectors mn are random) is equivalent to minimizing a regularized loss function. Clearly write down the expression of this regularized loss function.

Problem 3 (40 marks)
(Multi-output Regression with Reduced Number of Parameters) Consider the multi-output regression in which each output yn ∈ RM in a real-valued vector, rather than a scalar. Assuming a linear model, we can model the outputs as Y ≈ XW, where X is the N × D feature matrix and Y is N × M response matrix with row n being yn> (note that each column of Y denotes one of the M responses), and W is the D × M weight matrix, with its M columns containing the M weight vectors w1,w2,...,wM. Let’s define a squared error loss function, which is just the usual squared error but summed over all the M outputs.

Firstly, verify that this can also be written in a more compact notation as TRACE[(Y − XW)>(Y − XW)].

Further, we will assume that the weight matrix W can be written as a product of two matrices, i.e., W = BS where B is D × K and S is K × M (assume K < min{D,M}). Note that there is a benefit of modeling W this way, since now we need to learn only K × (D + M) parameters as opposed to D × M parameters and, if K is small, this can significantly reduce the number of parameters (in fact, reducing the effective number of parameters to be learned is another way of regularizing a machine learning model). Note (you can verify) that in this formulation, each wm can be written as a linear combination of K columns of B.

With the proposed representation of W, the new objective will be TRACE[(Y − XBS)>(Y − XBS)] and you need to learn both B and S by solving the following problem:

{Bˆ,Sˆ} = argminTRACE[(Y − XBS)>(Y − XBS)]

B,S

We will ignore regularization on B and S for brevity/simplicity.

Derive an alternating optimization (ALT-OPT) algorithm to learn B and S, clearly writing down the expressions for the updates of B and S. Are both subproblems (solving for B and solving for S) equally easy/difficult in this ALT-OPT algorithm? If yes, why? If no, why not?

Note: Since B and S are matrices, if you want, please feel free to use results for matrix derivatives (results you will need can be found in Sec. 2.5 of the Matrix Cookbook). However, the problem can be solved even without using matrix derivative results with some rearragement of terms and using vector derivatives.

Problem 4 (10 marks)
Ridge Regression using Newton’s Method Consider the ridge regression problem:

1           > − Xw) +w wˆ = argmin = argmin             (y − Xw) (y ww 2

where X is the N × D feature matrix and y is the N × 1 vector of labels of the N training examples. Note that the factor of  has been used in the above expression just for convenience of derivations required for this problem and does not change the solution to the problem.

Derive the Newton’s method’s update equations for each iteration. For this model, how many iterations would the Newton’s method will take to converge?

Problem 5 (20 marks)
(Dice Roll) You have a six-faced dice which you roll N times and record the number of times each of its six faces are observed. Suppose these numbers are N1,N2,...,N6, respectively. Assume that the probability of a random roll of the dice showning the kth face (k = 1,2,...,6) to be equal to πk ∈ (0,1).

Assuming an appropriate conjugate prior for the probability vector π = [π1,π2,...,π6], derive its MAP estimate. In which situation(s), you would expect the MAP solution to be better than the MLE solution?

Also derive the full posterior distribution over π using the same prior that you used for MAP estimate. Given this posterior, can you get the MLE and MAP estimate without solving the MLE and MAP optimization problems? If yes, how? If no, why not?

More products