Starting from:

$30

ML-Homework 2 Solved

Perceptrons
Which of the following set of x ∈ R3 can be shattered by the 3D perceptron hypothesis set? The set contains all hyperplanes of the form with our usual notation of x0 = 1:
 .

Choose the correct answer; explain your answer.

{(7,8,9),(17,18,19),(27,28,29)}
{(1,1,1),(7,8,9),(15,16,17),(21,23,25)}
{(1,1,3),(7,8,9),(15,16,17),(21,23,25)}
{(1,3,5),(7,8,9),(15,16,17),(21,23,25)}
{(1,2,3),(4,5,6),(7,8,9),(15,16,17),(21,23,25)}
What is the growth function of axis-aligned perceptrons in 2D for N ≥ 4? Those perceptrons are all perceptrons with w1w2 = 0. That is, they are vertical or horizontal lines on the 2D plane. Choose the correct answer; explain your answer.4N + 4
4N + 2 [c] 4N
[d] 4N − 2 [e] 4N − 4

What is the VC dimension of positively-biased perceptrons in 2D? The positively-biased perceptrons are all perceptrons with w0  Choose the correct answer; explain your answer.0
1
2
3
4
Ring Hypothesis Set
The “ring” hypothesis set in R3 contains hypothesis parameterized by two positive numbers a and b, where
                                                                                 +1      if

(x) =

                                                                                 −1      otherwise.

What is the growth function of the hypothesis set? Choose the correct answer; explain your answer.

[e] none of the other choices

Following the previous problem, what is the VC dimension of the ring hypothesis set? Choose the correct answer; explain your answer.1
2
3
6
none of the other choices
Deviation from Optimal Hypothesis
In Lecture 7, the VC bound was stated from the perspective of g, the hypothesis picked by the learning algorithm. The bound itself actually quantifies the BAD probability from any hypothesis h in the hypothesis set. That is,
P[∃h ∈ H s.t. .

Define the best-Ein hypothesis g = argminh∈HEin(h)

and the best-Eout hypothesis (which is optimal but can only be obtained by a “cheating” algorithm)

g∗ = argminh∈HEout(h).

Using the VC bound above, with probability more than 1 − δ, which of the following is an upper bound of Eout(g) − Eout(g∗)? Choose the correct answer; explain your answer.

The VC Dimension
For a finite hypothesis set H = {h1,h2,...,hM}, where each hypothesis is a binary classifier from X to {−1,+1}, what is the largest possible value of dvc(H)? Choose the correct answer; explain your answer.M
2M [c] M2
blog2 Mc
2M
A boolean function h: {−1,+1}k → {−1,+1} is called symmetric if its value does not depend on the permutation of its inputs, i.e., its value only depend on the number of ones in the input. What is the VC dimension of the set of all symmetric boolean functions? Choose the correct answer; explain your answer.k − 2
k − 1
k
k + 1
k + 2
How many of the following are necessary conditions for dvc(H) = d? Choose the correct answer; state which conditions correspond your answer and explain them.some set of d distinct inputs is shattered by H
some set of d distinct inputs is not shattered by H
any set of d distinct inputs is shattered by H
any set of d distinct inputs is not shattered by H
some set of d + 1 distinct inputs is shattered by H
some set of d + 1 distinct inputs is not shattered by H
any set of d + 1 distinct inputs is shattered by H
any set of d + 1 distinct inputs is not shattered by H
1
2
3
4
5
Which of the following hypothesis set is of VC dimension ∞? Choose the correct answer; explain your answer.the rectangle family: the infinite number of hypotheses where the boundary between ±1 regions of each hypothesis looks like a rectangle (including axis-aligned ones and rotated ones) for x ∈ R2
the intersected-interval family: the infinite number of hypotheses where the positive region of each hypothesis can be represented as an intersection of any finite number of “positive intervals” for x ∈ R
the sine family: the infinite number of hypotheses {hα: hα(x) = sign(sin(α  x))} for x ∈ R
the scaling family: the infinite number of hypothesis for x ∈ Rd
none of the other choices
Noise and Error
Consider a binary classification problem where we sample (x,y) from a distribution P with y ∈ {−1,+1}. Now we define a distribution Pτ to be a “noisy” version of P. That is, to sample from Pτ, we first sample (x,y) from P and flip y to −y with probability τ  Note that P0 = P. The distribution Pτ models a situation that our training data is labeled by an unreliable human, who mislabels with probability τ.
Define Eout(h,τ) to be the out-of-sample error of h with respect to Pτ. That is, Eout(h,τ) = E(x,y)∼Pτ h(x) 6= y .

                                                                                                              J                K

Which of the following relates Eout(h,τ) to Eout(h,0)? Choose the correct answer; explain your answer.

Consider x ∈ R3 and a target function f(x) = argmaxi=1,2,3xi, with ties broken, if any, by choosing the smallest i. Then, assume a process that generates (x,y) by a uniform P(x) within [0,1]3 and
                                                                              0.7      y = f(x)

x) mod 3 + 1
                                                                              0.2        y = ( (x) + 1) mod 3 + 1

The operation of “a mod 3” returns the residual when the integer a is divided by 3. When using the squared error, what is Eout(f) subject to the process above? Choose the correct answer; explain your answer. (Note: This is in some sense the “price of noise”)

0.3
0.6
0.9
1.2 [e] 1.5
Following Problem 12, the squared error defines an ideal target function
,

as shown on page 11 of the Lecture 8 slides. Unlike the slides, however, we denote this function as f∗ to avoid being confused with the target function f used for generating the data. Define the squared difference between f and f∗ to be

∆(f,f∗) = Ex∼P(x)(f(x) − f∗(x))2.

What is the value of ∆(f,f∗)? Choose the correct answer; explain your answer. (Note: This means how much the original target function f was dragged by the noise in P(y|x) to the “new” target function f∗.)

0.01
0.14
0.16
0.25
0.42
Decision Stump
In page 22 of the Lecture 5 slides (the Fun Time that you should play by yourself), we taught about the learning model of “positive and negative rays” (which is simply one-dimensional perceptron). The model contains hypotheses of the form:

hs,θ(x) = s · sign(x − θ),

where s ∈ {−1,+1} is the “direction” of the ray and θ ∈ R is the threshold. You can take sign(0) = −1 for simplicity. The model is frequently named the “decision stump” model and is one of the simplest learning models. As shown in class, the growth function of the model is 2N and the VC Dimension is 2.

When using the decision stump model, given 1 and δ = 0.1, among the five choices, what is the smallest N such that the BAD probability of the VC bound (as given in the beginning of Problem 6) is ≤ δ? Choose the correct answer; explain your answer.6000
8000
10000
12000
14000
In fact, the decision stump model is one of the few models that we could minimize Ein efficiently by enumerating all possible thresholds. In particular, for N examples, there are at most 2N dichotomies (see page 22 of the Lecture 5 slides), and thus at most 2N different Ein values. We can then easily choose the hypothesis that leads to the lowest Ein by the following decision stump learning algorithm.

(1)   sort all N examples xn to a sorted sequence such that

(2)   for each        1 and x0i 6= x0i+1} and s ∈ {−1,+1},

calculate Ein(hs,θ)

(3)   return the hs,θ with the minimum Ein as g; if multiple hypotheses reach the minimum Ein, return the one with the smallest s + θ.

(Hint: CS-majored students are encouraged to think about whether the second step can be carried out efficiently, i.e. O(N), using dxxxxxc pxxxxxxxxxg instead of the naive implementation of O(N2).)
Next, you are asked to implement such an algorithm and run your program on an artificial data set. We shall start by generating (x,y) with the following procedure. We will take the target function f(x) = sign(x):

Generate x by a uniform distribution in [−1,+1].
Generate y from x by y = f(x) and then flip y to −y with τ probability independently
For θ ∈ [−1,+1], what is Eout(h+1,θ,0), where Eout(h,τ) is defined in Problem 11? Choose the correct answer; explain your answer.
(*) For τ = 0, which means that your data is noiseless. Generate a data set of size 2 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is the mean of Eout(g,τ) − Ein(g) within the 10000 results? Choose the closest value. (By extending the results in Problem 11 and Problem 15, you can actually compute any Eout(hs,θ,τ)  But if you do not trust your math derivation, you can get a very accurate estimate of Eout(g) by evaluating g on a separate test data set of size 100000, as guaranteed by Hoeffding’s inequality).0.00
0.02
0.05
0.30
0.40
(*) For τ = 0, generate a data set of size 20 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is the mean of Eout(g,τ) − Ein(g) within the 10000 results? Choose the closest value.0.00
0.02
0.05
0.30
0.40
(*) For τ = 0.1, generate a data set of size 2 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is the mean of Eout(g,τ) − Ein(g) within the 10000 results?0.00
0.02
0.05
0.30
0.40
(*) For τ = 0.1, generate a data set of size 20 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is the mean of Eout(g,τ) − Ein(g) within the 10000 results? Choose the closest value.0.00
0.02
0.05
0.30
0.40
(*) For τ = 0.1, generate a data set of size 200 by the procedure above and run the decision stump algorithm on the data set to get g. Repeat the experiment 10000 times, each with a different data set. What is the mean of Eout(g,τ) − Ein(g) within the 10000 results? Choose the closest value.0.00
0.02
0.05nbz
0.30
0.40

More products