$30
1. Consider a noisy target , where x ∈ Rd+1 (including the added coordinate x0 = 1), y ∈ R, wf ∈ Rd+1 is an unknown vector, and is an i.i.d. noise term with zero mean and σ2 variance. Assume that we run linear regression on a training data set D = {(x1,y1),...,(xN,yN)} generated i.i.d. from some P(x) and the noise process above, and obtain the weight vector wlin. As briefly discussed in Lecture 5, it can be shown that the expected in-sample error Ein(wlin) with respect to D is given by:
ED [Ein(wlin .
For σ = 0.1 and d = 19, what is the smallest number of examples N such that ED [Ein(wlin)] is no less than 0.005? Choose the correct answer; explain your answer.
[a] 25
[b] 30
[c] 35
[d] 40
[e] 45
2. Consider the target function f(x) = x2. Sample x uniformly from [0,1], and use all linear hypotheses h(x) = w0 + w1 · x to approximate the target function with respect to the squared error. What are the weights ( ) of the optimal hypothesis? Choose the correct answer; explain your answer.
[a] (0,1
(Hint: The optimal hypothesis g∗ must reach the minimum Eout(g∗).)
3. Following the previous problem, assume that we sample two examples x1 and x2 uniformly from [0,1] to form the training set D = {(x1,f(x1)),(x2,f(x2)}, and use linear regression to get g for approximating the target function with respect to the squared error. You can neglect the degenerate cases where x1 and x2 are the same. What is ED (|Ein(g) − Eout(g)|)? Choose the correct answer; explain your answer.
[a]
[b] [c] [d] [e]
Cross-Entropy Error
4. In class, we introduced our version of the cross-entropy error function
based on the definition of yn ∈ {−1,+1}. If we transform , which of the following error function is equivalent to Ein above? Choose the correct answer; explain your answer.
[e] none of the other choices
5. Consider a coin with an unknown head probability µ. Independently flip this coin N times to get y1,y2,...,yN, where yn = 1 if the n-th flipping results in head, and 0 otherwise. Define
. How many of the following statements about ν are true? Choose the correct
answer; explain your answer by briefly illustrating why those statements are true.
• With probability more than 1 − δ,
for all N ∈ N and 0 < δ < 1.
• ν maximizes likelihood(ˆµ) over all ˆµ ∈ [0,1].
• ν minimizes the squared error
over all ˆy ∈ R.
• When 0 < ν < 1, it minimizes the cross-entropy error (which is similar to the cross-entropy error for logistic regression)
over all ˆy ∈ (0,1).
(Note: µ is similar to the role of the “target function” and µˆ is similar to the role of the “hypothesis” in our machine learning framework.)
[a] 0
[b] 1
[c] 2
[d] 3 [e] 4
Stochastic Gradient Descent
6. In the perceptron learning algorithm, we find one example (xn(t),yn(t)) that the current weight vector wt mis-classifies, and then update wt by
wt+1 ← wt + yn(t)xn(t).
The algorithm can be viewed as optimizing some Ein(w) that is composed of one of the following point-wise error functions with stochastic gradient descent (neglecting any non-differentiable points of the error function). What is the error function? Choose the correct answer; explain your answer.
[a] err(w,x,y) = max(0,−ywTx)
[b] err(w,x,y) = −max(0,−ywTx)
[c] err(w,x,y) = max(ywTx,−ywTx)
[d] err(w,x,y) = −max(ywTx,−ywTx)
[e] none of the other choices
Multinomial Logistic Regression
7. In Lecture 6, we solve multiclass classification by OVA or OVO decompositions. One alternative to deal with multiclass classification is to extend the original logistic regression model to Multinomial Logistic Regression (MLR). For a K-class classification problem, we will denote the output space
Y = {1,2,··· ,K}. The hypotheses considered by MLR can be indexed by a matrix
W = ,
that contains weight vectors (w1,··· ,wK), each of length d+1. The matrix represents a hypothesis
that can be used to approximate the target distribution P(y|x) for any (x,y). MLR then seeks for the maximum likelihood solution over all such hypotheses. For a given data set {(x1,y1),...,(xN,yN)} generated i.i.d. from some P(x) and target distribution P(y|x), the likelihood of hy(x) is proportional to ). That is, minimizing the negative log likelihood is equivalent to minimizing an Ein(W) that is composed of the following error function
err(W,x,y) = −lnhy(x) = −X y = k lnhk(x). k=1 J K
When minimizing Ein(W) with SGD, we update the W(t) at the t-th iteration to W(t+1) by
W(t+1) ← W(t) + η · V,
where V is a (d + 1) × K matrix whose k-th column is an update direction for the k-th weight vector. Assume that an example (xn,yn) is used for the SGD update above. What is the yn-th column of V? Choose the correct answer; explain your answer.
[a] (1 − hyn(xn))xn
[b] (hyn(xn) − 1)xn
[c] (−hyn(xn))xn
[d] (hyn(xn))xn
[e] none of the other choices
Nonlinear Transformation
8. Given the following training data set:
x1 = (0,1),y1 = −1 x2 = (0,−1),y2 = −1 x3 = (−1,0),y3 = +1 x4 = (1,0),y4 = +1
Use the quadratic transform ) and take sign(0) = 1. Which of the following weight vector w˜ represents a linear classifier in the Z-space that can separate all the transformed examples perfectly? Choose the correct answer; explain your answer.
[a] (0,−1,0,0,0,0)
[b] (0,0,−1,0,0,0)
[c] (0,0,0,−1,0,0)
[d] (0,0,0,0,−1,0)
[e] (0,0,0,0,0,−1)
9. Consider a feature transform Φ(x) = Γx where Γ is a (d + 1) by (d + 1) invertible matrix. For a training data set , run linear regression on the original data set, and get wlin. Then, run linear regression on the Φ-transformed data, and get w˜. For simplicity, assume that the matrix X (with every row being xTn) satisfies that XTX is invertible. What is the relationship between wlin and w˜? Choose the correct answer; explain your answer.
[a] wlin = Γw˜
[b] wlin = ΓTw˜
[c] wlin = (Γ−1)Tw˜
[d] wlin = Γ−1w˜
[e] none of the other choices
10. After “visualizing” the data and noticing that all x1,, x2, ..., xn are distinct, Dr. Trans magically decides the following transform
Φ(x) = ( x = x1 , x = x2 ,..., x = xN ).
J K J K J K
That is, Φ(x) is a N-dimentional vector whose n-th component is 1 if and only if x = xn. If we run linear regression after applying this transform, what is the optimal w˜? Choose the correct answer; explain your answer.
[a] 1, the vector of all 1s. [b] 0, the vector of all 0s.
[c] y
[d] −y
[e] none of the other choices
(Note: Be sure to also check what Ein(w˜) is!)
11. Assume that we coulple linear regression with one-versus-all decomposition for multi-class classification, and get K weight vectors w . Assume that the squared error ) for the k-th binary classification problem is ek. What is the tightest upper bound of ), where g is the multiclass classifier formed by the one-versus-all decomposition? Choose the correct answer; explain your answer.
Experiments with Linear and Nonlinear Models
Next, we will play with transform + linear regression for binary classification. Please use the following set for training:
https://www.csie.ntu.edu.tw/~htlin/course/ml21fall/hw3/hw3_train.dat
and the following set for testing (estimating Eout):
https://www.csie.ntu.edu.tw/~htlin/course/ml21fall/hw3/hw3_test.dat
Each line of the data set contains one (xn,yn) with xn ∈ R10. The first 10 numbers of the line contains the components of xn orderly, the last number is yn, which belongs to {−1,+1} ⊆ R. That is, we can use those yn for either binary classification or regression.
12. (*) Consider the following homogeneous order-Q polynomial transform
Transform the training and testing data according to Φ(x) with Q = 2, and implement the linear regression algorithm on the transformed data. What is , where g is the hypothesis returned by the transform + linear regression procedure? Choose the closest answer; provide your code.
[a] 0.28
[b] 0.32
[c] 0.36
[d] 0.40
[e] 0.44
13. (*) Repeat the previous problem, but with Q = 8 instead. What is , where g is the hypothesis returned by the transform + linear regression procedure? Choose the closest answer; provide your code.
[a] 0.30
[b] 0.35
[c] 0.40
[d] 0.45
[e] 0.50
14. (*) Repeat the previous problem, but with Φ2 (the full order-2 polynomial transform introduced in the lecture, which is of 1+10+45+10 dimensions) instead. What is , where g is the hypothesis returned by the transform + linear regression procedure? Choose the closest answer; provide your code.
[a] 0.33
[b] 0.41
[c] 0.49
[d] 0.57
[e] 0.65
15. (*) Instead of transforming to a higher dimensional space, we can also transform to a lower dimensional space. Consider the following 10 transforms:
(1)(x)
(x0,x1)
Φ(2)(x)
...
(x0,x1,x2)
Φ(10)(x)
=
(x0,x1,x2,...,x10)
Run Φ(i) + linear regression to get a hypothesis gi. What is the minimum over i? Choose the closest answer; provide your code.
[a] 1
[b] 2
[c] 3
[d] 5
[e] 8
16. (*) Consider a transform that randomly chooses 5 out of 10 dimensions. That is, Φ(x) =
(x0,xi1,xi2,xi3,xi4,xi5), where i1 to i5 are distinct random integers uniformly and independently generated within {1,2,...,10}. Run Φ + linear regression to get a hypothesis g. What is the average over 200 experiments, each generating Φ with a different random seed? Choose the closest answer; provide your code.
[a] 0.06
[b] 0.11
[c] 0.16
[d] 0.21
[e] 0.26