$30
Linear Regression
1. Consider a noisy target , where x ∈ Rd+1 (including the added coordinate x0 = 1), y ∈ R, wf ∈ Rd+1 is an unknown vector, and is an i.i.d. noise term with zero mean and σ2 variance. Assume that we run linear regression on a training data set D = {(x1,y1),...,(xN,yN)} generated i.i.d. from some P(x) and the noise process above, and obtain the weight vector wlin. As briefly discussed in Lecture 9, it can be shown that the expected in-sample error Ein(wlin) with respect to D is given by:
.
For σ = 0.1 and d = 11, what is the smallest number of examples N such that ED [Ein(wlin)] is no less than 0.006? Choose the correct answer; explain your answer.
[a] 25
[b] 30
[c] 35
[d] 40
[e] 45
2. As shown in Lecture 9, minimizing Ein(w) for linear regression means solving ∇Ein(w) = 0, which in term means solving the so-called normal equation
XTXw = XTy.
Which of the following statement about the normal equation is correct for any features X and labels y? Choose the correct answer; explain your answer.
[a] There exists at least one solution for the normal equation.
[b] If there exists a solution for the normal equation, Ein(w) = 0 at such a solution.
[c] If there exists a unique solution for the normal equation, Ein(w) = 0 at the solution.
[d] If Ein(w) = 0 at some w, there exists a unique solution for the normal equation. [e] none of the other choices
3. In Lecture 9, we introduced the hat matrix H = XX† for linear regression. The matrix projects the label vector y to the “predicted” vector yˆ = Hy and helps us analyze the error of linear regression. Assume that XTX is invertible, which makes H = X(XTX)−1XT. Now, consider the following operations on X. Which operation can possibly change H? Choose the correct answer; explain your answer.
[a] multiplying the whole matrix X by 2 (which is equivalent to scaling all input vectors by 2)
[b] multiplying each of the i-th column of X by i (which is equivalent to scaling the i-th feature by i)
[c] multiplying each of the n-th row of X by (which is equivalent to scaling the n-th example by
[d] adding three randomly-chosen columns i,j,k to column 1 of X
(i.e., xn,1 ← xn,1 + xn,i + xn,j + xn,k)
[e] none of the other choices (i.e. all other choices are guaranteed to keep H unchanged.)
Likelihood and Maximum Likelihood
4. Consider a coin with an unknown head probability θ. Independently flip this coin N times to get y1,y2,...,yN, where yn = 1 if the n-th flipping results in head, and 0 otherwise. Define
. How many of the following statements about ν are true? Choose the correct
answer; explain your answer by illustrating why those statements are true.
• Pr( ) for all N ∈ N and
• ν maximizes likelihood(θˆ) over all θˆ∈ [0,1].
• ν minimizes over all ˆy ∈ R.
• 2 · ν is the negative gradient direction −∇Ein(yˆ) at ˆy = 0.
(Note: θ is similar to the role of the “target function” and θˆ is similar to the role of the “hypothesis” in our machine learning framework.)
[a] 0
[b] 1
[c] 2
[d] 3
[e] 4
5. Let y1,y2,...,yN be N values generated i.i.d. from a uniform distribution [0,θ] with some unknown θ. For any θˆ ≥ max(y1,y2,...,yN), what is its likelihood? Choose the correct answer; explain your answer.
(Hint: Those who are interested in more math [who isn’t? :-)] are encouraged to try to derive the maximum-likelihood estimator.)
Gradient and Stochastic Gradient Descent
6. In the perceptron learning algorithm, we find one example (xn(t),yn(t)) that the current weight vector wt mis-classifies, and then update wt by
wt+1 ← wt + yn(t)xn(t).
A variant of the algorithm finds all examples (xn,yn) that the weight vector wt mis-classifies (e.g. yn 6= sign(wtTxn)), and then update wt by
w .
n: yn6=sign(wtTxn)
The variant can be viewed as optimizing some Ein(w) that is composed of one of the following pointwise error functions with a fixed learning rate gradient descent (neglecting any non-differentiable spots of Ein). What is the error function? Choose the correct answer; explain your answer.
[a] err(w,x,y) = |1 − ywTx|
[b] err(w,x,y) = max(0,−ywTx) [c] err(w,x,y) = −ywTx
[d] err(w,x,y) = min(0,−ywTx)
[e] err(w,x,y) = max(0,1 − ywTx)
7. Besides the error functions introduced in the lectures so far, the following error function, exponential error, is also widely used by some learning models. The exponential error is defined by errexp(w,x,y) = exp(−ywTx). If we want to use stochastic gradient descent to minimize an Ein(w) that is composed of the error function, which of the following is the update direction −∇errexp(w,xn,yn) for the chosen (xn,yn) with respect to wt? Choose the correct answer; explain your answer.
[a] +ynxn exp(−ynwTxn)
[b] −ynxn exp(−ynwTxn)
[c] +xn exp(−ynwTxn)
[d] −xn exp(−ynwTxn)
[e] none of the other choices
Hessian and Newton Method
8. Let E(w): Rd → R be a function. Denote the gradient bE(w) and the Hessian AE(w) by
b and A .
Then, the second-order Taylor’s expansion of E(w) around u is:
.
Suppose AE(u) is positive definite. What is the optimal direction v such that w ← u+v minimizes the right-hand-side of the Taylor’s expansion above? Choose the correct answer; explain your answer. (Note that iterative optimization with v is generally called Newton’s method.)
[a] +(AE(u))−1bE(u)
[b] −(AE(u))−1bE(u)
[c] +(AE(u))+1bE(u)
[d] −(AE(u))+1bE(u)
[e] none of the other choices
9. Following the previous problem, considering minimizing Ein(w) in linear regression problem with Newton’s method. For any given wt, what is the Hessian AE(wt) with E = Ein? Choose the correct answer; explain your answer.
b N2 XTX
X
[e] none of the other choices
Multinomial Logistic Regression
10. In Lecture 11, we solve multiclass classification by OVA or OVO decompositions. One alternative to deal with multiclass classification is to extend the original logistic regression model to Multinomial Logistic Regression (MLR). For a K-class classification problem, we will denote the output space
Y = {1,2,··· ,K}. The hypotheses considered by MLR can be indexed by a matrix
W = ,
that contains weight vectors (w1,··· ,wK), each of length d+1. The matrix represents a hypothesis
that can be used to approximate the target distribution P(y|x) for any (x,y). MLR then seeks for the maximum likelihood solution over all such hypotheses. For a given data set {(x1,y1),...,(xN,yN)} generated i.i.d. from some P(x) and target distribution P(y|x), the likelihood of hy(x) is proportional to ). That is, minimizing the negative log likelihood is equivalent to minimizing an Ein(W) that is composed of the following error function
K
err(W,x,y) = −lnhy(x) = −X y = k lnhk(x). k=1 J K
When minimizing Ein(W) with SGD, we need to compute ∂ err(W∂Wik,x,y). What is the value of the partial derivative? Choose the correct answer; explain your answer.
J K
[e] none of the other choices
11. Following the previous problem, consider a data set with K = 2 and obtain the optimal solution from MLR as (w ). Now, relabel the same data set by replacing yn with 3 to form a binary classification data set. Which of the following is an optimal solution for logistic regression on the binary classification data set? Choose the correct answer; explain your answer.
Nonlinear Transformation
12. Given the following training data set:
x1 = (0,1),y1 = −1 x2 = (1,−0.5),y2 = −1 x3 = (−1,0),y3 = −1
x4 = (−1,2),y4 = +1 x5 = (2,0),y5 = +1 x6 = (1,−1.5),y6 = +1 x7 = (0,−2),y7 = +1
Using the quadratic transform ), which of the following weights w˜ T in the Z-space can separate all of the training data correctly? Choose the correct answer; (no, you don’t need to explain your answer :-)).
[a] [−9,−1,0,2,−2,3]
[b] [−5,−1,2,3,−7,2]
[c] [9,−1,4,2,−2,3]
[d] [2,1,−4,−2,7,−4]
[e] [−7,0,0,2,−2,3]
13. Consider the following feature transform, which maps x ∈ Rd to z ∈ R1+1, keeping only the kth coordinate of x: Φ(k)(x) = (1,xk). Let Hk be the set of hypothesis that couples Φ(k) with perceptrons. Among the following choices, which of is the tightest upper bound of for d ≥ 4? Choose the correct answer; explain your answer. (Hint: You can use the fact that for d ≥ 4 if needed.)
[a] 2((log2 log2 d) + 1)
[b] 2((log2 d) + 1)
[c] 2((dlog2 d) + 1)
[d] 2(d + 1)
[e] 2(d2 + 1)
Experiments with Linear and Nonlinear Models
Next, we will play with linear regression, logistic regression, non-linear transform, and their use for binary classification. Please use the following set for training:
https://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw3/hw3_train.dat
and the following set for testing (estimating Eout):
https://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw3/hw3_test.dat
Each line of the data set contains one (xn,yn) with xn ∈ R10. The first 10 numbers of the line contains the components of xn orderly, the last number is yn, which belongs to {−1,+1} ⊆ R. That is, we can use those yn for either binary classification or regression.
14. (*) Add xn,0 = 1 to each xn. Then, implement the linear regression algorithm on page 11 of Lecture 9. What is Einsqr(wlin), where Einsqr denotes the averaged squared error over N examples? Choose the closest answer; provide your code.
[a] 0.00
[b] 0.20
[c] 0.40
[d] 0.60
[e] 0.80
15. (*) Add xn,0 = 1 to each xn. Then, implement the SGD algorithm for linear regression using the results on pages 10 and 12 of Lecture 11. Pick one example uniformly at random in each iteration, take η = 0.001 and initialize w with w0 = 0. Run the algorithm until Einsqr(wt) ≤ 1.01Einsqr(wlin), and record the total number of iterations taken. Repeat the experiment 1000 times, each with a different random seed. What is the average number of iterations over the 1000 experiments? Choose the closest answer; provide your code.
[a] 600
[b] 1200
[c] 1800
[d] 2400
[e] 3000
16. (*) Add xn,0 = 1 to each xn. Then, implement the SGD algorithm for logistic regression by replacing the SGD update step in the previous problem with the one on page 10 of Lecture 11. Pick one example uniformly at random in each iteration, take η = 0.001 and initialize w with w0 = 0. Run the algorithm for 500 iterations. Repeat the experiment 1000 times, each with a different random seed. What is the average ) over the 1000 experiments, where Eince denotes the averaged cross-entropy error over N examples? Choose the closest answer; provide your code.
[a] 0.44
[b] 0.50
[c] 0.56
[d] 0.62
[e] 0.68
17. (*) Repeat the previous problem, but with w initialized by w0 = wlin of Problem 14 instead. Repeat the experiment 1000 times, each with a different random seed. What is the average ) over the 1000 experiments? Choose the closest answer; provide your code.
[a] 0.44
[b] 0.50
[c] 0.56
[d] 0.62
[e] 0.68
18. (*) Following Problem 14, what is (wlin) (wlin) , where 0/1 denotes the 0/1 error (i.e.
(0/1) using wlin for binary classification), and Eout is estimated using the test set provided above? Choose the closest answer; provide your code.
[a] 0.32
[b] 0.36
[c] 0.40
[d] 0.44
[e] 0.48
19. (*) Next, consider the following homogeneous order-Q polynomial transform
.
Transform the training and testing data according to Φ(x) with Q = 3, and again implement the linear regression algorithm on page 11 of lecture 9. What is , where g is the hypothesis returned by the transform + linear regression procedure? Choose the closest answer; provide your code.
[a] 0.32
[b] 0.36
[c] 0.40
[d] 0.44
[e] 0.48
20. (*) Repeat the previous problem, but with Q = 10 instead. What is ? Choose the closest answer; provide your code.
[a] 0.32
[b] 0.36
[c] 0.40
[d] 0.44
[e] 0.48