$30
1. VC-dimension of Neural Networks - Upper bound. We will now finish what we have started in the previous recitation and assignment. Let C be the class of hypotheses implementable by neural networks (NN) with L layers (including the output layer, excluding the input layer), each layer has exactly d nodes (except the output layer which has a single node), and the sign activation function for all nodes.
Denote by H the family of linear separators in Rd, we have seen that the output function of any single node i in the t-th layer implements a function which is a member of H. Seen as a whole function, each layer implements a function from Rd to Rd:
f(t+1)(zt) := zt+1 = h(W(t+1)zt + b(t))
where h operates element-wise. Denote by F the class of such functions.
(a) Show that Π for every n ≥ d + 1.
(b) Express C in terms of H. Give a bound on the growth function of C, for n ≥ d + 1.
(c) Let N by the number of parameters in a multilayer NN as defined above. Express N in terms of d and L (number of layers).
(d) Show that 2n ≤ (en)N → n ≤ 2N log2(eN).
(e) We are finally in a position to derive a bound for the VC-dimension. Show that ΠC(n) ≤ (en)N, and use this to show that V Cdim(C) ≤ 2N log2(eN).
2. Suboptimality of ID3. Solve exercise 2 in chapter 18 in the book: Understanding Machine Learning: From Theory to Algorithms.
3. (15 points, 5 points for each section) AdaBoost. Let x1,...,xm ∈ Rd and y1,...,ym ∈ {−1,1} its labels. We run the AdaBoost algorithm as given in the recitation, and we are in iteration t. Assume that
(a) (Do not submit) Show that . Use the latter equalities to show that
(b) Show that the error of the current hypothesis relative to the new hypothesis is exactly1/2, that is:
.
(c) Show that AdaBoost will not pick the same hypothesis twice consecutively; that isht+1 6= ht.
(d) Show that setting the weights to be brings Zt to a minimum.
1
2 Handout Homework 6: May 24, 2020
4. (10 points, 5 points for each section) Sufficient Condition for Weak Learnability. Let S = {(x1,y1),...,(xn,yn)} be a training set and let H be a hypothesis class. Assume that there exists γ 0, hypotheses h1,...,hk ∈ H and coefficients = 1 for which the following holds:
k
yi Xajhj(xi) ≥ γ (1)
j=1
for all (xi,yi) ∈ S.
(a) Show that for any distribution D over S there exists 1 ≤ j ≤ k such that
.
(Hint: Take expectation of both sides of inequality (1) with respect to D.)
(b) Let S = {(x1,y1),...,(xn,yn)} ⊆ Rd × {−1,1} be a training set that is realized by a d-dimensional hyper-rectangle classifier, i.e., there exists a d dimensional hyper-rectangle [a1,b1] × ··· × [ad,bd] that classifies the data correctly. Let H be the class of decision stumps of the form
,
for 1 ≤ j ≤ d and θ ∈ R ∪ {∞,−∞} (for θ ∈ {∞,−∞} we get constant hypotheses which predict always 1 or always −1). Show that there exist γ 0, k 0, hypotheses h1,...,hk ∈ H and a1,...,ak ≥ 0 with = 1, such that the condition in inequality
(1) holds for the training set S and hypothesis class H.
(Hint: Set k = 4d − 1 and let 2d − 1 of the hypotheses be constant.)
5. (15 points, 7.5 points for each section) Linear regression with dependent variables. Consider the regression problem where X is a n×d data matrix, y is a column vector of size n, and w is a column vector of size d of coefficients. As we discussed in the lecture, if there are dependent variables there are infinite possible solutions that achieve this minimum. One sensible criterion to choose one among all possible solutions, is to prefer a solution with a minimal `2 norm. That is, we search for w that solves the following problem:
argminkwk2 w
s.t.Xw = y
Assume that d n and that the matrix X has rank n (note that it in principle the rank can be smaller than n). And denote by w? the optimum of the above problem.
(a) (No need to submit) Convince yourself that there exists a w such that Xw = y. Namely, the above problem has at least one feasible solution.
(b) Show that the optimal w can be written as a linear combination of the data point. Namely, there exists a vector α ∈ Rn such that the solution is given by w? = XT α.
(c) Show that you can calculate xT w? for all x by using only dot products between x ∈ Rd (hint: express the solution using the kernel matrix KS = XXT ).
Note that the above implies that you can use the “kernel trick” in this case. Namely, you can also work with features φ(x) as long as you can calculate the corresponding kernel.
Handout Homework 6: May 24, 2020 3
Programming Assignment
Submission guidelines:
• Download the supplied files from Moodle (2 python files and 1 tar.gz file). Details on every file will be given in the exercises. You need to update the code only in the skeleton files, i.e., the files that have a prefix ”skeleton”. Written solutions, plots and any other non-code parts should be included in the written solution submission.
• Your code should be written in Python 3.
• Make sure to comment out or remove any code which halts code execution, such as matplotlib popup windows.
• Your code submission should include these files: adaboost.py,process data.py
1. (30 points) AdaBoost. In this exercise, we will implement AdaBoost and see how boosting can be applied to real-world problems. We will focus on binary sentiment analysis, the task of classifying the polarity of a given text into two classes - positive or negative. We will use movie reviews from IMDB as our data.
Download the provided files from Moodle and put them in the same directory:
• review polarity.tar.gz - a sentiment analysis dataset of movie reviews from IMBD.[1]Extract its content in the same directory (with any of zip, 7z, winrar, etc.), so you will have a folder called review polarity.
• process data.py - code for loading and preprocessing the data.
• skeletonadaboost.py - this is the file you will work on, change its name to adaboost.py before submitting.
The main function in adaboost.py calls the parse data method, that processes the data and represents every review as a 5000 vector x. The values of x are counts of the most common words in the dataset (excluding stopwords like “a” and “and”), in the review that x represents. Concretely, let w1,...,w5000 be the most common words in the data, given a review ri we represent it as a vector xi ∈ N5000 where xi,j is the number of times the word wj appears in ri. The method parse data returns a training data, test data and a vocabulary. The vocabulary is a dictionary that maps each index in the data to the word it represents (i.e. it maps j → wj).
(a) (10 points) Implement the AdaBoost algorithm in the run adaboost function. The class of weak learners we will use is the class of hypothesis of the form:
,
That is, comparing a single word count to a threshold. At each iteration, AdaBoost will select the best weak learner. Note that the labels are {−1,1}. Run AdaBoost for T = 80 iterations. Show plots for the training error and the test error of the classifier implied at each iteration t, sign(Ptj=1 αjhj(x)).
4 Handout Homework 6: May 24, 2020
(b) (10 points) Run AdaBoost for T = 10 iterations. Which weak classifiers the algorithm chose? Pick 3 that you would expect to help to classify reviews and 3 that you did not expect to help, and explain possible reasons for the algorithm to choose them.
(c) (10 points) In next recitation you will see that AdaBoost minimizes the average exponential loss:
.
Run AdaBoost for T = 80 iterations. Show plots of ` as a function of T, for the training and the test sets. Explain the behavior of the loss.
[1] http://www.cs.cornell.edu/people/pabo/movie-review-data/