Starting from:

$30

NTUCSIE-Homework 4 Solved

1.    Consider the augmented error
  wTw
with some λ > 0. Optimize the error by gradient descent with η as the learning rate.That is,
w(t + 1) ← w(t) − η∇Eaug(w(t))
The update rule above is equivalent to multiplying w(t) by a shrinkage factor s before updating with the negative gradient of Ein
w(t + 1) ← s · w(t) − η∇Ein(w(t))
What is s? Choose the correct answer; explain your answer
[e]   none of the other choices

2. Consider N “labels”  with each yn ∈ R. Then, solve the following one-variable regularized problem:
If the optimal solution to the problem above is w∗, it can be shown that w∗ is also the optimal solution of
  subject to w2 ≤ C
with C = (w∗)2. This allows us to express the relationship between C in the constrained optimization problem and λ in the augmented optimization problem for any λ > 0. What is the relationship? Choose the correct answer; explain your answer.
[e]   none of the other choices
(Note: All the choices hint you that a smaller λ corresponds to a bigger C.)

3.    Scaling can affect regularization. Consider a data set . Define Φ(x) = Vx where V is a diagonal matrix with the i-th diagonal component storing a positive value to scale the i-th feature. Now, conduct L2-regularized linear regression with the transformed data .
 X

                                                                                                          n            yn

                                                                  w˜∈Rd+1 N                                                  N

The problem is equivalent to the following regularized linear regression problem on the original data with a different regularizer.

N
                                                             min                                                (w Uw)
w
What is U? Choose the correct answer; explain your answer.

[a]   V

[b]   V2

[c]    V−1

[d]   (V−1)2

[e]   none of the other choices

4.    Noise might seem to be an absolute evil at first glance. Can we add noise to combat overfitting, much like how we get vaccinated? Consider a noisy transform Φ(x) = x + ϵ that adds a noise vector ϵ = (ϵ0,··· ,ϵd) to x. Assume that the noise vector is generated i.i.d. from a multivariate normal distribution N(0,σ2I). Consider the following optimization problem that minimizes the expected Ein of the transformed data:
 min               w
The problem is actually equivalent to L2-regularized linear regression with some λ.
 N
min               w 2 w
What is λ? Choose the correct answer; explain your answer.
[a] Nσ2
[e]   none of the other choices

5. Additive smoothing (https://en.wikipedia.org/wiki/Additive_smoothing) is a simple yet useful technique in estimating discrete probabilities. Consider the technique for estimating the head probability of a coin. Let y1,y2,...,yN denotes the flip results from a coin, with yn = 1 represents a head and yn = 0 represents a tail. Additive smoothing adds (αk) “virtual flips”, with α of them being head and the other (k − 1)α being tail. Then, the head probability is estimated by
The estimate can be viewed as the optimal solution of
where Ω(y) is a “regularizer” to this estimation problem. What is Ω(y)? Choose the correct answer; explain your answer.
[a] (y + k)2
 none of the other choices

6. When performing L2-regularization on any twice-differentiable Ein(w) for some linear model of interest, the optimization problem can be written as:
min     Eaug(w) w∈Rd+1 subject to  
Suppose w∗ is the minimizer of Ein(w). That is, ∇Ein(w∗) = 0. Take the second-order Taylor’s expansion of Ein(w) around w∗, we can approximate Ein(w) by
  H (w − w∗)
where H ∈ R(d+1)×(d+1) is some Hessian matrix. Then, Eaug(w) can be approximated by
Which of the following is the minimizer of E˜aug(w)? Choose the correct answer; explain your answer.
[c] (HTH + λI)−1HTHw∗
[e]   none of the other choices.
(Note: The result demonstrates the difference between the minimizers of Ein and E˜aug)

Validation
7. Consider a binary classification algorithm Aminority, which returns a constant classifier that always predicts the minority class (i.e., the class with fewer instances in the data set that it sees). As you can imagine, the returned classifier is the worst-Ein one among all constant classifiers. For a binary classification data set with N positive examples and N negative examples, what is Eloocv(Aminority)? Choose the correct answer; explain your answer.
[a] 0
[b] 1/N
[c] 1/2
[d]   (N − 1)/N
[e]   1

8. You are given three data points: (x1,y1) = (2,0),(x2,y2) = (ρ,2),(x3,y3) = (−2,0) with ρ ≥ 0, and a choice between two models: constant (all hypotheses are of the form h(x) = w0) and linear (all hypotheses are of the form h(x) = w0 + w1x). For which value of ρ would the two models be tied using leave-one-out cross-validation with the squared error measure? Choose the closest answer; explain your answer.

[a]   6.5

[b]   7.5

[c]    8.5

[d]   9.5

[e]   10.5

9.        For N labels y1,y2,...,yN generated i.i.d. from some distribution of mean 0 and variance σ2. Partition the first N −K labels to be the training data, and the last K labels to be the validation data. If we “estimate” the mean by the “target” of 0, the expwhere the expectation is taken on the process of generating N labels, is simply σ2 by the independence of the yn’s. Now, assume that we take the average on the first N − K examples to estimate the mean instead. That is,
What is the expected validation error

Choose the correct answer; explain your answer.

[a]   σ2

[e] none of the other choices

Learning Principles

10.   we talked about the probability to fit data perfectly when the labels are random. For instance, page 5 of Lecture 9 shows that the probability of fitting the data perfectly with positive rays is . Consider 4 different points on the unit circle in R2 as input vectors x1, x2, x3 and x4, and a 2D perceptron model that minimizes Ein(w) (0/1 error) to the lowest possible value. One way to measure the power of the model is to consider four random labels y1, y2, y3, y4, each in ±1 and generated by i.i.d. fair coin flips, and then compute

For a perfect fitting, minEin(w) will be 0; for a less perfect fitting (when the data is not linearly separable), minEin(w) will be some non-zero value. The expectation above averages over all 16 possible combinations of y1, y2, y3, y4. What is the value of the expectation? Choose the correct answer; explain your answer.

[a]   1/32

[b]   2/32

[c]    3/32

[d] 5/32
[e] 8/32
(Note: It can be shown that 1 minus twice the expected value above is the same as the so-called empirical Rademacher complexity of 2D perceptrons. Rademacher complexity, similar to the VC dimension, is another tool to measure the complexity of a hypothesis set. If a hypothesis set shatters some data points, zero Ein can always be achieved and thus Rademacher complexity is 1; if a hypothesis set cannot shatter some data points, Rademacher complexity provides a soft measure of how “perfect” the hypothesis set is.)

11. Consider a binary classifier g such that

                                                                               P(g(x) = −1|y = +1)        =      ϵ+

                                                                               P(g(x) = +1|y = −1)        =      ϵ−.

When deploying the classifier to a test distribution of P(y = +1) = P(y = −1) = 1/2, we get

 . Now, if we deploy the classifier to another test distribution P(y = +1) = p

instead of 1/2, the Eout(g) under this test distribution will then change to a different value. Note that under this test distribution, a constant classifier g− that always predicts −1 will suffer from Eout(g−) = p as it errors on all the positive examples. At what p, if its value is between [0,1], will our binary classifier g be as good as (or as bad as) the constant classifier g− in terms of Eout? Choose the correct answer; explain your answer.
[e] none of the other choices

Experiments with Regularized Logistic Regression

Consider L2-regularized logistic regression with third-order polynomial transformation.

 wλ = argmin,

w

Here Φ3 is the third-order polynomial transformation introduced in Lecture 6 (with Q = 3), defined as

Next, we will take the following file as our training data set D:

http://www.csie.ntu.edu.tw/~htlin/course/ml21fall/hw4/hw4_train.dat

and the following file as our test data set for evaluating Eout:

http://www.csie.ntu.edu.tw/~htlin/course/ml21fall/hw4/hw4_test.dat

We call the algorithm for solving the problem above as Aλ. The problem guides you to use LIBLINEAR (https://www.csie.ntu.edu.tw/~cjlin/liblinear/), a machine learning packaged developed in our university, to solve this problem. In addition to using the default options, what you need to do when running LIBLINEAR are

•   set option -s 0, which corresponds to solving regularized logistic regression

•   set option -c C, with a parameter value of C calculated from the λ that you want to use; read README of the software package to figure out how C and your λ should relate to each other

•   set option -e 0.000001, which corresponds to getting a solution that is really really close to the optimal solution

LIBLINEAR can be called from the command line or from major programming languages like python. If you run LIBLINEAR in the command line, please include screenshots of the commands/results; if you run LIBLINEAR from any programming language, please include screenshots of your code.

We will consider the data set as a binary classification problem and take the “regression for classification” approach with regularized logistic regression (see Page 6 of Lecture 10). So please evaluate all errors below with the 0/1 error.

12.   Select the best λ∗ in a cheating manner as
                                                                                     argmin             Eout(wλ).

log10 λ∈{−4,−2,0,2,4}

Break the tie, if any, by selecting the largest λ. What is log10(λ∗)? Choose the closest answer; provide your command/code.

[a]   -4

[b]   -2 [c] 0

[d]   2

[e]   4

13.   Select the best λ∗ as
  
log10 λ∈{−4,−2,0,2,4}

Break the tie, if any, by selecting the largest λ. What is log10(λ∗)? Choose the closest answer; provide your command/code.

[a]   -4

[b]   -2 [c] 0

[d]   2

[e]   4

14.   Now split the given training examples in D to two sets: the first 120 examples as Dtrain and 80 as Dval. (Ideally, you should randomly do the 120/80 split. Because the given examples are already randomly permuted, however, we would use a fixed split for the purpose of this problem). Run Aλ on only Dtrain to get wλ− (the weight vector within the g− returned), and validate wλ− with Dval to get ). Select the best λ∗ as

Break the tie, if any, by selecting the largest λ. Then, estimate Eout(wλ−∗) with the test set. What is the value of Eout(wλ−∗)? Choose the closest answer; provide your command/code.

[a]   0.081

[b]   0.078

[c]    0.075

[d]   0.072

[e]   0.069

15.   For the λ∗ selected in the previous problem, compute wλ∗ by running Aλ∗ with the full training set D. Then, estimate Eout(wλ∗) with the test set. What is the value of Eout(wλ∗)? Choose the closest answer; provide your command/code.

[a]   0.081

[b]   0.078

[c]    0.075

[d]   0.072

[e]   0.069

16.   Now split the given training examples in D to five folds, the first 40 being fold 1, the next 40 being fold 2, and so on. Again, we take a fixed split because the given examples are already randomly permuted. Select the best λ∗ as

                                                                                         argmin             Ecv(Aλ).

log10 λ∈{−4,−2,0,2,4}

Break the tie, if any, by selecting the largest λ. What is the value of Ecv(Aλ∗) Choose the closest answer; provide your command/code.

[a]   0.07

[b]   0.08

[c]    0.09

[d]   0.10

[e]   0.11

More products