$30
Deterministic Noise
1. (Lecture 13) Consider the target function f(x) = ex. When x is uniformly sampled from [0,2], and we use all linear hypotheses h(x) = w · x to approximate the target function with respect to the squared error, what is the magnitude of deterministic noise for each x? Choose the correct answer; explain your answer.
[a] |ex|
(Hint: If you want to take page 17 of Lecture 13 for inspiration, please note that the answer on page 17 is not exact. Here, however, we are asking you for an exact answer.)
Learning Curve
2. (Lecture 13) Learning curves are important for us to understand the behavior of learning algorithms. The learning curves that we have plotted in lecture 13 come from polynomial regression with squared error, and we see that the expected Ein curve is always below the expected Eout curve. Next, we think about whether this behavior is also true in general. Consider the 0/1 error, an arbitrary non-empty hypothesis set H, and a learning algorithm A that returns one h ∈ H with the minimum Ein on any non-empty data set D. That is,
A(D) = argminEin(h).
h∈H
Assume that each example in D is generated i.i.d. from a distribution P, and define Eout(h) with respect to the distribution. How many of the following statements are always false?
• ED[Ein(A(D))] < ED[Eout(A(D))]
• ED[Ein(A(D))] = ED[Eout(A(D))]
• ED[Ein(A(D))] ED[Eout(A(D))]
Choose the correct answer; explain your answer.
[a] 0
[b] 1
[c] 2
[d] 3
[e] 1126 (seriously?)
(Hint: Think about the optimal hypothesis h∗ = argminh∈H Eout(h).)
Noisy Virtual Examples
3. (Lecture 13) On page 20 of Lecture 13, we discussed about adding “virtual examples” (hints) to help combat overfitting. One way of generating virtual examples is to add a small noise to the input vector x ∈ Rd+1 (including the 0-th component x0) For each (x1,y1),(x2,y2),...,(xN,yN) in our training data set, assume that we generate virtual examples (x˜1,y1),(x˜2,y2),...,(x˜N,yN) where x˜n is simply xn + and the noise vector ∈ Rd+1 is generated i.i.d. from a multivariate normal distribution N(0,σ2 · Id+1). Here 0 ∈ Rd+1 denotes the all-zero vector and Id+1 is an identity matrix of size d + 1.
Recall that when training the linear regression model, we need to calculate XTX first. Define the hinted input matrix
X .
What is the expected value ), where the expectation is taken over the (Gaussian)-noise generating process above? Choose the correct answer; explain your answer.
[a] XTX + σ2Id+1
[b] XTX + 2σ2Id+1
[c] 2XTX + σ2Id+1
[d] 2XTX + Nσ2Id+1
[e] 2XTX + 2Nσ2Id+1
(Note: The choices here “hint” you that the expected value is related to the matrix being inverted for regularized linear regression—see page 10 of Lecture 14. That is, data hinting “by noise” is closely related to regularization. If x contains the pixels of an image, the virtual example is a Gaussian-noise-contaminated image with the same label, e.g. https://en.wikipedia.org/wiki/ Gaussian_noise. Adding such noise is a very common technique to generate virtual examples for images.)
4. (Lecture 13) Following the previous problem, when training the linear regression model, we also need to calculate XTy. Define the hinted label vector . What is the expected value
E(XThyh), where the expectation is taken over the (Gaussian)-noise generating process above? Choose the correct answer; explain your answer.
[a] 2NXTy
[b] NXTy
[c] 0
[d] XTy
[e] 2XTy
Regularization
5. (Lecture 14) Consider the matrix of input vectors as X (as defined in Lecture 9), and assume XTX to be invertible. That is, XTX must be symmetric positive definite and can be decomposed to QΓQT, where Q is an orthogonal matrix (QTQ = QQT = Id+1) and Γ is a diagonal matrix that contains the eigenvalues γ0, γ1, ..., γd of XTX. Note that the eigenvalues must be positive.
Now, consider a feature transform Φ(x) = QTx. The feature transform “rotates” the original x. After transforming each xn to zn = Φ(xn), denote the new matrix of transformed input vectors as Z. That is, Z = XQ. Then, apply regularized linear regression in the Z-space (see Lecture 12). That is, solve
1 min kZw − y . w∈Rd+1 N
Denote the optimal solution when λ = 0 as v (i.e. wlin), and the optimal solution when λ 0 as u (i.e., wreg). What is the ratio ui/vi? Choose the correct answer; explain your answer.
(Note: All the choices are of value < 1 if λ 0. This is the behavior of weight “decay”—wreg is shorter than wlin. That is why the L2-regularizer is also called the weight-decay regularizer)
6. (Lecture 14) Consider a one-dimensional data set where each xn ∈ R and yn ∈ R.
Then, solve the following one-variable regularized linear regression problem:
.
If the optimal solution to the problem above is w∗, it can be shown that w∗ is also the optimal solution of
subject to w2 ≤ C
with C = (w∗)2. This allows us to express the relationship between C in the constrained optimization problem and λ in the augmented optimization problem for any λ 0. What is the relationship? Choose the correct answer; explain your answer.
(Note: All the choices hint you that a smaller λ corresponds to a bigger C.)
7. (Lecture 14) Additive smoothing (https://en.wikipedia.org/wiki/Additive_smoothing) is a simple yet useful technique in estimating discrete probabilities. Consider the technique for estimating the head probability of a coin. Let y1,y2,...,yN denotes the flip results from a coin, with yn = 1 meaning a head and yn = 0 meaning a tail. Additive smoothing adds 2K “virtual flips”, with K of them being head and the other K being tail. Then, the head probability is estimated by
The estimate can be viewed as the optimal solution of
,
where Ω(y) is a “regularizer” to this estimation problem. What is Ω(y)? Choose the correct answer; explain your answer.
[a] (y + 1)2
[b] (y + 0.5)2
[c] y2
[d] (y − 0.5)2
[e] (y − 1)2
8. (Lecture 14) On page 12 of Lecture 14, we mentioned that the ranges of features may affect regularization. One common technique to align the ranges of features is to consider a “scaling” transformation. Define Φ(x) = Γ−1x, where Γ is a diagonal matrix with positive diagonal values γ0,γ1,...,γd. Then, conducting L2-regularized linear regression in the Z-space.
X w˜∈Rd+1 N N
is equivalent to regularized linear regression in the X-space
N min w
with a different regularizer Ω(w). What is Ω(w)? Choose the correct answer; explain your answer.
[a] wTΓw
[b] wTΓ2w
[c] wTw
[d] wTΓ−2w
[e] wTΓ−1w
9. (Lecture 13/14) In the previous problem, regardless of which regularizer you choose, the optimization problem is of the form
N d
min w2 w
with positive constants βi. We will call the problem “scaled regularization.”
Now, consider linear regression with virtual examples. That is, we add K virtual examples
(x˜1,y˜1),(x˜2,y˜2)...(x˜K,y˜K) to the training data set, and solve
! min .
w
We will show that using some “special” virtual examples, which were claimed to be a possible way to combat overfitting in Lecture 13, is related to regularization, another possible way to combat overfitting discussed in Lecture 14.
Let X = [˜ x˜1x˜2 ...x˜K]T, y˜ = [y˜1,y˜2 ...y˜K]T, and B be a diagonal matrix that contains β0,β1,β2,...,βd in its diagonals. Set K = d+1, for what X and˜ y˜ will the optimal solution of this linear regression be the same as the optimal solution of the scaled regularization problem above? Choose the correct answer; explain your answer.
[a] X =˜ λIK,y˜ = 0
√ √
[b] X =˜ λ · B,y˜ = 0
√
[c] X =˜ λ · B,y˜ = 0
√
[d] X = B˜ ,y˜ = λ1
[e] X = B˜ ,y˜ = λ1
(Note: Both Problem 3 and this problem show that data hinting is closely related to regularization.)
Leave-one-out
10. (Lecture 15) Consider a binary classification algorithm Amajority, which returns a constant classifier that always predicts the majority class (i.e., the class with more instances in the data set that it sees). As you can imagine, the returned classifier is the best-Ein one among all constant classifiers. For a binary classification data set with N positive examples and N negative examples, what is Eloocv(Amajority)? Choose the correct answer; explain your answer.
[a] 0
[b] 1/N
[c] 1/2
[d] (N − 1)/N
[e] 1
11. (Lecture 15) Consider the decision stump model and the data generation process mentioned in Problem 16 of Homework 2, and use the generation process to generate a data set of N examples (instead of 2). If the data set contains at least two positive examples and at least two negative examples, which of the following is the tightest upper bound on the leave-one-out error of the decision stump model? Choose the correct answer; explain your answer.
[a] 0
[b] 1/N
[c] 2/N
[d] 1/2
[e] 1
12. (Lecture 15) You are given three data points: (x1,y1) = (3,0),(x2,y2) = (ρ,2),(x3,y3) = (−3,0) with ρ ≥ 0, and a choice between two models: constant (all hypotheses are of the form h(x) = w0) and linear (all hypotheses are of the form h(x) = w0 + w1x). For which value of ρ would the two models be tied using leave-one-out cross-validation with the squared error measure? Choose the correct answer; explain your answer.
13. (Lecture 15) Consider a probability distribution P(x,y) that can be used to generate examples (x,y), and suppose we generate K i.i.d. examples from the distribution as validation examples, and store them in Dval. For any fixed hypothesis h, we can show that
.
val
Which of the following is ? Choose the correct answer; explain your answer.
[a] K
[b] 1
Learning Principles
14. (Lecture 16) In Lecture 16, we talked about the probability to fit data perfectly when the labels are random. For instance, page 6 of Lecture 16 shows that the probability of fitting the data perfectly with decision stumps is (2N)/2N. Consider 4 vertices of a rectangle in R2 as input vectors x1, x2, x3, x4, and a 2D perceptron model that minimizes Ein(w) to the lowest possible value. One way to measure the power of the model is to consider four random labels y1, y2, y3, y4, each in ±1 and generated by i.i.d. fair coin flips, and then compute
.
For a perfect fitting, minEin(w) will be 0; for a less perfect fitting (when the data is not linearly separable), minEin(w) will be some non-zero value. The expectation above averages over all 16 possible combinations of y1, y2, y3, y4. What is the value of the expectation? Choose the correct answer; explain your answer.
[a] 0/64
[b] 1/64
[c] 2/64
[d] 4/64
[e] 8/64
(Note: It can be shown that 1 minus twice the expected value above is the same as the so-called empirical Rademacher complexity of 2D perceptrons. Rademacher complexity, similar to the VC dimension, is another tool to measure the complexity of a hypothesis set. If a hypothesis set shatters some data points, zero Ein can always be achieved and thus Rademacher complexity is 1; if a hypothesis set cannot shatter some data points, Rademacher complexity provides a soft measure of how “perfect” the hypothesis set is.)
15. (Lecture 16) Consider a binary classifier g such that
.
When deploying the classifier to a test distribution of P(y = +1) = P(y = −1) = 1/2, we get
. Now, if we deploy the classifier to another test distribution P(y = +1) = p
instead of 1/2, the Eout(g) under this test distribution will then change to a different value. Note that under this test distribution, a constant classifier gc that always predicts +1 will suffer from Eout(gc) = (1−p) as it errors on all the negative examples. At what p, if its value is between [0,1], will our binary classifier g be as good as (or as bad as) the constant classifier gc in terms of Eout? Choose the correct answer; explain your answer.
Experiments with Regularized Logistic Regression
Consider L2-regularized logistic regression with second-order polynomial transformation.
wλ = argmin,
w
Here Φ2 is the second-order polynomial transformation introduced in page 2 of Lecture 12 (with Q = 2), defined as
Given that d = 6 in the following data sets, your Φ2(x) should be of 28 dimensions (including the constant dimension).
Next, we will take the following file as our training data set D:
http://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw4/hw4_train.dat
and the following file as our test data set for evaluating Eout:
http://www.csie.ntu.edu.tw/~htlin/course/ml20fall/hw4/hw4_test.dat
We call the algorithm for solving the problem above as Aλ. The problem guides you to use LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/), a machine learning packaged developed in our university, to solve this problem. In addition to using the default options, what you need to do when running LIBLINEAR are
• set option -s 0, which corresponds to solving regularized logistic regression
• set option -c C, with a parameter value of C calculated from the λ that you want to use; read README of the software package to figure out how C and your λ should relate to each other
• set option -e 0.000001, which corresponds to getting a solution that is really really close to the optimal solution
LIBLINEAR can be called from the command line or from major programming languages like python. If you run LIBLINEAR in the command line, please include screenshots of the commands/results; if you run LIBLINEAR from any programming language, please include screenshots of your code.
We will consider the data set as a binary classification problem and take the “regression for classification” approach with regularized logistic regression (see Page 6 of Lecture 10). So please evaluate all errors below with the 0/1 error.
16. Select the best λ∗ in a cheating manner as
argmin Eout(wλ).
log10 λ∈{−4,−2,0,2,4}
Break the tie, if any, by selecting the largest λ. What is log10(λ∗)? Choose the closest answer; provide your command/code.
[a] −4
[b] −2
[c] 0
[d] 2
[e] 4
17. Select the best λ∗ as
argmin Ein(wλ).
log10 λ∈{−4,−2,0,2,4}
Break the tie, if any, by selecting the largest λ. What is log10(λ∗)? Choose the closest answer; provide your command/code.
[a] −4
[b] −2
[c] 0
[d] 2
[e] 4
18. Now split the given training examples in D to two sets: the first 120 examples as Dtrain and 80 as Dval. (Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem). Run Aλ on only Dtrain to get wλ− (the weight vector within the g− returned), and validate wλ− with Dval to get ). Select the best λ∗ as
.
Break the tie, if any, by selecting the largest λ. Then, estimate Eout(wλ−∗) with the test set. What is the value of Eout(wλ−∗)? Choose the closest answer; provide your command/code.
[a] 0.10
[b] 0.11
[c] 0.12
[d] 0.13
[e] 0.14
19. For the λ∗ selected in the previous problem, compute wλ∗ by running Aλ∗ with the full training set D. Then, estimate Eout(wλ∗) with the test set. What is the value of Eout(wλ∗)? Choose the closest answer; provide your command/code.
[a] 0.10
[b] 0.11
[c] 0.12
[d] 0.13
[e] 0.14
20. Now split the given training examples in D to five folds, the first 40 being fold 1, the next 40 being fold 2, and so on. Again, we take a fixed split because the given examples are already randomly permuted. Select the best λ∗ as
argmin Ecv(Aλ).
log10 λ∈{−4,−2,0,2,4}
Break the tie, if any, by selecting the largest λ. What is the value of Ecv(Aλ∗) Choose the closest answer; provide your command/code.
[a] 0.10
[b] 0.11
[c] 0.12
[d] 0.13
[e] 0.14