$25
. Union Of Intervals. In this question, we will study the hypothesis class of a finite union of disjoint intervals, and the properties of the ERM algorithm for this class.
To review, let the sample space be X = [0,1] and assume we study a binary classification problem, i.e. Y = {0,1}. We will try to learn using an hypothesis class that consists of k intervals. More explicitly, let I = {[l1,u1],...,[lk,uk]} be k disjoint intervals, such that 0 ≤ l1 ≤ u1 ≤ l2 ≤ u2 ≤ ... ≤ uk ≤ 1. For each such k disjoint
intervals, define the corresponding hypothesis as
(
1 if x ∈ [l1,u1],...,[lk,uk]
hI(x) =
0 otherwise
Finally, define Hk as the hypothesis class that consists of all hypotheses that correspond to k disjoint intervals:
Hk = {hI|I = {[l1,u1],...,[lk,uk]}, 0 ≤ l1 ≤ u1 ≤ l2 ≤ u2 ≤ ... ≤ uk ≤ 1}
We note that Hk ⊆ Hk+1, since we can always take one of the intervals to be of length 0 by setting its endpoints to be equal. We are given a sample of size m = hx1,y1i,...,hxn,ymi. Assume that the points are sorted, so that 0 ≤ x1 < x2 < ... < xm ≤ 1.
Explanation on intervals.py:
The file intervals.py includes a function that implements an ERM algorithm for Hk. Given a sorted list xs = [x1,...,xm], the respective labeling ys = [y1,...,ym] and k, the given function find best interval returns a list of up to k intervals and their error count on the given sample. These intervals have the smallest empirical error count possible from all choices of k intervals or less.
Note that in sections (d)-(f) you will need to use this function for large values of m. Execution in these cases could take time (more than 10 minutes for experiment), so plan ahead.
4
(a) Assume that the true distribution P[x,y] = P[y|x] · P[x] is: x is distributed uniformly on the interval [0,1], and
(
0.8 if x ∈ [0,0.2] ∪ [0.4,0.6] ∪ [0.8,1]
P[y = 1|x] =
0.1 if x ∈ [0.2,0.4] ∪ [0.6,0.8]
and P[y = 0|x] = 1 − P[y = 1|x].
Write a function that draws m pairs of (x,y) according to the distribution P. Use it to draw a sample of size m = 100 and create a plot:
i. Plot the points and their label (have the y axis in range −0.1,1.1 for clarity of presentation).
ii. Mark the lines x = 0.2,0.4,0.6,0.8 clearly on the plot.
iii. Run the find_best_interval function on your sample with k = 3, and plot the intervals clearly.
(b) Note that here, we know the true distribution P, so for every given hypothesis h ∈ Hk, we can calculate error(h) precisely. What is the hypothesis with the smallest error?
(c) Write a function that, given a list of intervals, calculates the error of the respective hypothesis. Then, for k = 3, m = 10,15,20,...,100, perform the following experiment T = 100 times: (i) Draw a sample of size m and run the ERM algorithm on it; (ii) Calculate the empirical error for the returned hypothesis; (iii) Calculate the true error for the returned hypothesis. Plot the average empirical and true errors, averaged across the T runs, as a function of m. Discuss the results. Do the empirical and true error decrease or increase in m? Why?
(d) Draw a data set of m = 1500 samples. Find the best ERM hypothesis for k = 1,2,...,20, and plot the empirical and true errors as a function of k. How does the error behave? Define k∗ to be the k with the smallest empirical error for ERM? Does this mean the hypothesis with k∗ intervals is a good choice?
(e) Now we will use the principle of structural risk minimization (SRM), to search for a k that gives good test error. Let δ = 0.1:
• Use your results from question 3 in the theoretical part and the VC confidence bound to construct a penalty function.
• Draw a data set of m = 1500 samples, run the experiment in (d) again, but now plot two additional lines as a function of k: 1) the penalty for the best ERM hypothesis and 2) the sum of penalty and empirical error.
• What is the best value for k in each case? is it better than the one you chose in (d)?
(f) Here we will use holdout-validation to search for a k that gives good test error. Draw a data set of m = 1500 samples and use 20% for a holdoutvalidation. Choose the best hypothesis based on 3 experiments. Discuss how close this gets you to finding the hypothesis with optimal true error.