$25
1. AdaBoost. In this exercise, we will implement AdaBoost and see how boosting can be applied to real-world problems. We will focus on binary sentiment analysis, the task of classifying the polarity of a given text into two classes - positive or negative. We will use movie reviews from IMDB as our data.
Download the provided files from Moodle and put them in the same directory:
• review polarity.tar.gz - a sentiment analysis dataset of movie reviews from IMBD.[1]Extract its content in the same directory (with any of zip, 7z, winrar, etc.), so you will have a folder called review polarity.
• process data.py - code for loading and preprocessing the data.
• skeletonadaboost.py - this is the file you will work on, change its name to adaboost.py before submitting.
The main function in adaboost.py calls the parse data method, that processes the data and represents every review as a 5000 vector x. The values of x are counts of the most common words in the dataset (excluding stopwords like “a” and “and”), in the review that x represents. Concretely, let w1,...,w5000 be the most common words in the data, given a review ri we represent it as a vector xi ∈ N5000 where xi,j is the number of times the word wj appears in ri. The method parse data returns a training data, test data and a vocabulary. The vocabulary is a dictionary that maps each index in the data to the word it represents (i.e. it maps j → wj).
(a) Implement the AdaBoost algorithm in the run adaboost function. The class of weak learners we will use is the class of hypothesis of the form:
,
That is, comparing a single word count to a threshold. At each iteration, AdaBoost will select the best weak learner. Note that the labels are {−1,1}. Run AdaBoost for T = 80 iterations. Show plots for the training error and the test error of the classifier implied at each iteration t, sign(Ptj=1 αjhj(x)).
4
(b) Run AdaBoost for T = 10 iterations. Which weak classifiers the algorithm chose? Pick 3 that you would expect to help to classify reviews and 3 that you did not expect to help, and explain possible reasons for the algorithm to choose them.
(c) In next recitation you will see that AdaBoost minimizes the average exponential loss:
.
Run AdaBoost for T = 80 iterations. Show plots of ` as a function of T, for the training and the test sets. Explain the behavior of the loss.
[1] http://www.cs.cornell.edu/people/pabo/movie-review-data/