$30
1. Singletons (Section 3.5, Ex.2 in the course textbook). Let X be a discrete domain, and let HSingleton = {hz : z ∈ X} ∪ {h−}, where for each z ∈ X , hz is the function defined by hz(x) = 1 if x = z and hz(x) = 0 if x 6= z. h− is simply the all-negative hypothesis, namely, ∀x ∈ X,h−(x) = 0.
The realizability assumption here implies that the true hypothesis f labels negatively all examples in the domain, perhaps except one.
Describe an algorithm that implements the ERM rule for learning HSingleton in the realizable setup.
2. PAC in Expectation. Consider learning in the realizable case. We say an hypothesis class H is PAC learnable in expectation if there exists a learning algorithm A and a function N(a) : (0,1) → N such that ∀a ∈ (0,1) and for any distribution P, given a sample set S, such that |S| ≥ N(a) it holds that,
E[eP(A(S))] ≤ a
Show that H is PAC learnable if and only if H is PAC learnable in expectation (Hint: Use Markov’s inequality and refer to derivations between equations 3.13-3.14 in the lecture scribes about VC).
3. Union Of Intervals. Determine the VC-dimension of the subsets of the real line formed by the union of k intervals (see question 1 of the programming assignment for a formal definition of H).
4. Right-angle Triangles. Determine the VC-dimension of the hypotheses class H, of axis-aligned right-angle triangles in the plane, with the right angle in the lower left corner.
5. Structural Risk Minimization. Let H be a countable hypothesis class, that is, H can be written as H = Si∈N{hi}. Let w : H → [0,1] be a function such that Ph∈H w(h) ≤ 1. We refer to w as a weight function over the hypotheses which reflects the prior for each hypothesis.
(a) Show that with probability 1 − δ over the choice S ∼ P (|S| = n)
1
2 Handout Homework 2: March 22, 2020
Hint: use the uniform convergence property for each hypothesis class {hi} of size
1.
(b) Denote . Show that with probability 1 − δ:
,
where hmin = argmin eP(h).
h∈H
This implies that when the hypothesis with optimal error has high weight, it will be learned from few samples.
6. (8 points) Loss Minimization. Consider binary classification with the following loss function:
0 y = yˆ
∆b(y,yˆ) = 0.5 y = 1,yˆ = 0
1 y = 0,yˆ = 1
Find the optimal classifier h? = argminh E[∆b(Y,h(X))].
Handout Homework 2: March 22, 2020 3
Programming Assignment
1. Union Of Intervals. In this question, we will study the hypothesis class of a finite union of disjoint intervals, and the properties of the ERM algorithm for this class.
To review, let the sample space be X = [0,1] and assume we study a binary classification problem, i.e. Y = {0,1}. We will try to learn using an hypothesis class that consists of k intervals. More explicitly, let I = {[l1,u1],...,[lk,uk]} be k disjoint intervals, such that 0 ≤ l1 ≤ u1 ≤ l2 ≤ u2 ≤ ... ≤ uk ≤ 1. For each such k disjoint
intervals, define the corresponding hypothesis as
(
1 if x ∈ [l1,u1],...,[lk,uk]
hI(x) =
0 otherwise
Finally, define Hk as the hypothesis class that consists of all hypotheses that correspond to k disjoint intervals:
Hk = {hI|I = {[l1,u1],...,[lk,uk]}, 0 ≤ l1 ≤ u1 ≤ l2 ≤ u2 ≤ ... ≤ uk ≤ 1}
We note that Hk ⊆ Hk+1, since we can always take one of the intervals to be of length 0 by setting its endpoints to be equal. We are given a sample of size m = hx1,y1i,...,hxn,ymi. Assume that the points are sorted, so that 0 ≤ x1 < x2 < ... < xm ≤ 1.
Submission Guidelines:
• Download the files skeleton.py and intervals.py from Moodle. You should implement only the missing code in skeleton.py, as specified in the following questions. In every method description, you will find specific details on its input and return values.
• Your code should be written with python 3.
• Make sure to comment out / remove any code which halts the code execution, such as matplitlib popup.
• Your submission should include exactly two files: assignment2.py, intervals.py.
Explanation on intervals.py:
The file intervals.py includes a function that implements an ERM algorithm for Hk. Given a sorted list xs = [x1,...,xm], the respective labeling ys = [y1,...,ym] and k, the given function find best interval returns a list of up to k intervals and their error count on the given sample. These intervals have the smallest empirical error count possible from all choices of k intervals or less.
Note that in sections (d)-(f) you will need to use this function for large values of m. Execution in these cases could take time (more than 10 minutes for experiment), so plan ahead.
4 Handout Homework 2: March 22, 2020
(a) Assume that the true distribution P[x,y] = P[y|x] · P[x] is: x is distributed uniformly on the interval [0,1], and
(
0.8 if x ∈ [0,0.2] ∪ [0.4,0.6] ∪ [0.8,1]
P[y = 1|x] =
0.1 if x ∈ [0.2,0.4] ∪ [0.6,0.8]
and P[y = 0|x] = 1 − P[y = 1|x].
Write a function that draws m pairs of (x,y) according to the distribution P. Use it to draw a sample of size m = 100 and create a plot:
i. Plot the points and their label (have the y axis in range −0.1,1.1 for clarity of presentation).
ii. Mark the lines x = 0.2,0.4,0.6,0.8 clearly on the plot.
iii. Run the find_best_interval function on your sample with k = 3, and plot the intervals clearly.
(b) Note that here, we know the true distribution P, so for every given hypothesis h ∈ Hk, we can calculate error(h) precisely. What is the hypothesis with the smallest error?
(c) Write a function that, given a list of intervals, calculates the error of the respective hypothesis. Then, for k = 3, m = 10,15,20,...,100, perform the following experiment T = 100 times: (i) Draw a sample of size m and run the ERM algorithm on it; (ii) Calculate the empirical error for the returned hypothesis; (iii) Calculate the true error for the returned hypothesis. Plot the average empirical and true errors, averaged across the T runs, as a function of m. Discuss the results. Do the empirical and true error decrease or increase in m? Why?
(d) (7 points) Draw a data set of m = 1500 samples. Find the best ERM hypothesis for k = 1,2,...,20, and plot the empirical and true errors as a function of k. How does the error behave? Define k∗ to be the k with the smallest empirical error for ERM? Does this mean the hypothesis with k∗ intervals is a good choice?
(e) (7 points) Now we will use the principle of structural risk minimization (SRM), to search for a k that gives good test error. Let δ = 0.1:
• Use your results from question 3 in the theoretical part and the VC confidence bound to construct a penalty function.
• Draw a data set of m = 1500 samples, run the experiment in (d) again, but now plot two additional lines as a function of k: 1) the penalty for the best ERM hypothesis and 2) the sum of penalty and empirical error.
• What is the best value for k in each case? is it better than the one you chose in (d)?
(f) (7 points) Here we will use holdout-validation to search for a k that gives good test error. Draw a data set of m = 1500 samples and use 20% for a holdoutvalidation. Choose the best hypothesis based on 3 experiments. Discuss how close this gets you to finding the hypothesis with optimal true error.