Starting from:

$30

NTUCSIE-Homework 2 Solved

Perceptrons
1.    Which of the following set of x ∈ R3 can be shattered by the 3D perceptron hypothesis set? The set contains all hyperplanes of the form with our usual notation of x0 = 1:
Choose the correct answer; explain your choice.

[a]                    {(2,3,4),(4,3,2),(3,3,3)}

[b]                   {(1,1,1),(2,3,4),(4,3,2),(4,2,3)}

[c]                    {(1,1,1),(2,3,4),(4,3,2),(2,2,2)}

[d]                   {(1,1,1),(2,3,4),(4,3,2),(4,2,3),(3,2,4)}

[e]                    none of the other choices (none of them can be shattered)

2.    What is the growth function of origin-passing perceptrons in 2D? Those perceptrons are all perceptrons with w0 = 0. Choose the correct answer; explain your choice.

Hint: Put your input vectors on the unit circle, and perhaps think about a polar coordinate system.

[a]                    2N + 2

[b]                   2N + 1 [c] 2N

[d] 2N − 1 [e] 2N − 2

Donut Hypothesis Set
3.    The “donut” hypothesis set in Rd contains hypothesis parameterized by two positive numbers a and b, where

                                                                                            +1       if 

                                                                                           −1      otherwise.

What is the growth function of the hypothesis set? Choose the correct answer; explain your choice.
[e] none of the other choices

4.    Following the previous problem, what is the VC dimension of the donut hypothesis set? Choose the correct answer; explain your choice.

[a] d [b] 6
[c]    3
[d]   2

[e]   none of the other choices

More on VC Dimension
5.    Which of the following hypothesis set is of a different VC dimension, compared to others? Choose the correct answer; explain your choice.
[a]   Unions of two positive intervals over x ∈ R, which returns +1 if x is within at least one of the intervals.
[b]   Axis-aligned rectangle classifiers for x ∈ R2, which returns +1 if x is inside a rectangle whose edges are parallel to the axes of R2.
[c]    Positively-biased perceptrons over x ∈ R4, which contains perceptrons with w0 > 0.
[]   Polynomial hypotheses of degree 3 for x ∈ R, which are of the form
h(x) = sign(Xwixi).
i=0
[e]   none of the other choices

6.    For a finite hypothesis set H with 1126 binary classifiers, what is the largest possible value of dvc(H)? Choose the correct answer; explain your choice.
[a]   1126
[b]   112
[c]    11
[d]   10
[e]   1

Deviation from Optimal Hypothesis
7. Recall that the multiple-bin Hoeffding bound quantifies the BAD probability from any hypothesis h in the hypothesis set. That is,
P[∃h ∈ H s.t.   .
Define the best-Ein hypothesis g = argminh∈HEin(h)
and the best-Eout hypothesis (which is optimal but can only be obtained by a “cheating” algorithm)
g∗ = argminh∈HEout(h)
Using the multiple-bin Hoeffding bound above, with probability more than 1 − δ, which of the following is an upper bound of Eout(g)−Eout(g∗)? Choose the correct answer; explain your choice.

VC Bound
8. When using the positive ray model taught in class, given  1 and δ = 0.1, among the five choices, what is the smallest N such that the BAD probability of the VC bound
P[∃h ∈ H s.t.  
is ≤ δ? Choose the correct answer; explain your choice.
[a]   10000
[b]   11000
[c]    12000
[d]   3000
[e]   14000

Hessian and Newton Method
9.         Let E(w): Rd → R be a function. Denote the gradient bE(w) and the Hessian AE(w) by

               b                     and A  .

Then, the second-order Taylor’s expansion of E(w) around u is:
Suppose AE(u) is positive definite. What is the optimal direction v such that w ← u+v minimizes the right-hand-side of the Taylor’s expansion above? Choose the correct answer; explain your choice.
Note that iterative optimization with v is generally called Newton’s method.
[a]   +(AE(u))−1bE(u)
[b]   −(AE(u))−1bE(u
[c]    +(AE(u))+1bE(u)
[d]  −(AE(u))+1bE(u)
[e]   none of the other choices

10.    Following the previous problem, considering minimizing Ein(w) in logistic regression problem with Newton’s method. Consider a data set  with the cross-entropy error function for Ein:
For any given wt, let
What is the Hessian AE(wt) with E = Ein? Choose the correct answer; explain your choice.
[e] none of the other choices
Linear Regression

 

11.   In the lecture, we learned that the linear regression weights w must satisfy the normal equation XTXw = XTy. If XTX is not invertible, we cannot compute wlin = (XTX)−1XTy. Then, one possible solution is wlin = X†y,

where X† is called the Moore-Penrose pseudo-inverse. Let X = UΣVT be the singular value decomposition of the N by d + 1 matrix X, with U and V being unitary matrices. The MoorePenrose pseudo-inverse X† = VΣ†UT is a d + 1 by N matrix, where Σ  when Σ[n][i] is nonzero, and 0 otherwise. Which of the following statements related to X† is incorrect? Choose the incorrect statement; explain your choice.
[a]   (XTX)−1XT = X† when XTX is invertible.
[b]   For any k ∈ Z+, (XX†)k = XX†.
[c]    XX† = IN, the N by N identity matrix.
[d]   trace(XX†) = rank(X).
[e]   none of the other choices

12.   In the lecture, we learned that the logistic regression can be derived by maximizing the likelihood function where each label yn is generated from P(+1|xn) “pretended by” 1/(1 + exp(−wTxn)). Now, consider a case where each real-valued label yn is generated from p(y|xn), where p is used instead of P to denote a probability density function (instead of a probability mass function—you can ignore the subtle mathematical difference for now). We will consider p(y|xn) to be pretended by a normal distribution with mean wTxn and variance a2. Assume that all inversions in the equations below exist, what is the optimal w∗ that maximizes the likelihood function for this case? Choose the correct answer; explain your choice.
 y  y  y  y
[e] none of the other choices

Experiments with Linear Models
Next, we will play with linear regression, logistic regression, and their use for binary classification. We will also learn how the their different robustness to outliers. We will generate artificial 2D data for training and testing in the next problems. Each example (x,y) is assumed to be generated from the following process:

•   Flip a fair coin to get either y = +1 or y = −1.

•   If y = +1, generate x = (1,x1,x2) where (x1,x2) comes from a normal distribution of mean [2,3]
and covariance  .

•   If y = −1, generate x = (1,x1,x2) where (x1,x2) comes from a normal distribution of mean [0,4]
and covariance  .

Please generate N = 200 examples from the process as your training data set D. Then, generate 5000 more examples from the process as your test data set (for evaluating Eout).

Hint: Be sure to check whether your normal distribution function needs you to provide the variance,√
which would be like 0.6 for the yn = +1 cases, or the standard deviation, which would be like                 0.6.

13.   (*) Implement the linear regression algorithm taught in the lecture. Run the algorithm for 100 times, each with a different random seed for generating the two data sets above. What is the average Einsqr(wlin), where Einsqr denotes the averaged squared error over N examples? Choose the closest answer; provide your code.
[a]   0.04
[b]   0.16
[c]    0.24
[d]   0.28
[e]   0.40

14.   (*) Following the previous problem, what is the average  (wlin)  (wlin) , where 0/1
denotes the 0/1 error (i.e. using wlin for binary classification),  denotes the averaged 0/1 error over N examples,   is estimated using the averaged 0/1 error on the test data set above? Choose the closest answer; provide your code.
[a]   0.091
[b]   0.065
[c]    0.039
[d]   0.013
[e]   0.001

15.   (*) Consider two algorithms. The first one, A, is the linear regression algorithm above. The second one B is logistic regression, trained with gradient descent with η = 0.1 for T = 500 iterations, starting from w0 = 0. Run the algorithms on the same D, and record [ 
Repeat the process for 100 times, each with a different random seed for generating the two data sets above. What is the average [  ))]? Choose the closest answer; provide your code.
[a]   (0.018, 0.018)
[b]   (0.058, 0.058)
[c]    (0.058, 0.093)
[d]   (0.138, 0.108)
[e]   (0.268, 0.208)

16.   (*) Following the previous problem, in addition to the 200 examples in D, add 20 outlier examples generated from the following process to your training data (but not to your test data). All outlier examples will be labeled y = +1 and x = [1,x1,x2] where (x1,x2) comes from a normal distribution of mean [6,0] and covariance  . Name the new training data set D0. Run
the algorithms on the same D0, and record [     ))]. Repeat the process for
100 times, each with a different random seed for generating the two data sets above. What is the average [  ))]? Choose the closest answer; provide your code.
[a]   (0.070, 0.018)
[b]   (0.240, 0.048)
[c]    (0.090, 0.058)
d]   (0.090, 0.078)
[e]   (0.270, 0.108)

More products