Question 1 – Boosting
We learned about boosting in lecture and the topic is covered in Murphy 16.4. On page 555 Murphy claims that “it was proved that one could boost the performance (on the training set) of any weak learner arbitrarily high, provided the weak learned could always perform slightly better than chance.” We will now verify this in the AdaBoost framework.
1. (7 points) Given a set of N observations (xj,yj) where yj is the label yj ∈ {−1,1}, let ht(x) be the weak classifier at step t and let αt be its weight. First we note that the final classifier after T steps is defined as:
( T )
H(x) = sgn Xαtht(x) = sgn{f(x)}
t=1
Where
T
f(x) = Xαtht(x)
t=1 Show that:
Training
Where δ(H(xj) 6= yj) is 1 if H(xj) 6= yj and 0 otherwise.
2. The weight for each data point j at step t + 1 can be defined recursively by:
Where Zt is a normalizing constant ensuring the weights sum to 1:
N
Zt = Xwjt exp(−αtyjht(xj))
j=1
Show that:
3. We showed above that training error is bounded above by . At step t the values Z1,Z2,...,Zt−1 are already fixed therefore at step t we can choose αt to minimize Zt. Let
be the weighted training error for weak classifier ht(x) then we can re-write the formula for Zt as:
(a) First find the value of αt that minimizes Zt then show that
(b) Assume we choose Zt this way. Then re-write where γt 0 implies better than random and γt < 0 implies worse than random. Then show that:
Zt ≤ exp(−2γt2) You may want to use the fact that log(1 − x) ≤ −x for 0 ≤ x < 1
Thus we have:
T T
training ≤ YZt ≤ exp(−2Xγt2)
t=1 t=1
(c) Finally, show that if each classifier is better than random (e.g. γt ≥ γ for all t and γ 0) that:
training ≤ exp(−2Tγ2)
Which shows that the training error can be made arbitrarily small with enough steps.
2 Programming Question (clustering with K-means) [30 points]
In class we discussed the K-means clustering algorithm. Your programming assignment this week is to implement the K-means algorithm on digit data.
2.1 The Data
There are two files with the data. The first digit.txt contains the 1000 observations of 157 pixels (a subset of the original 785) from images containing handwritten digits. The second file labels.txt contains the true digit label (either 1, 3, 5, or 7). You can read both data files in with
X = load(‘../hw5data/digit/digit.txt’);
Y = load(‘../hw5data/digit/labels.txt’);
Please note that there aren’t IDs for the digits. Please assume the first line is ID 1, the second line is ID 2, and so on. The labels correspond to the digit file, so the first line of labels.txt is the label for the digit in the first line of digit.txt.
2.2 The algorithm
Your algorithm should be implemented as follows:
1. Select k starting centers that are points from your data set. You should be able to select these centers randomly or have them given as a parameter.
2. Assign each data point to the cluster associated with the nearest of the k center points.
3. Re-calculate the centers as the mean vector of each cluster from (2).
4. Repeat steps (2) and (3) until convergence or iteration limit.
Define convergence as no change in label assignment from one step to another or you have iterated 20 times (whichever comes first). Please count your iterations as follows: after 20 iterations, you should have assigned the points 20 times.
2.3 Within group sum of squares
The goal of clustering can be thought of as minimizing the variation within groups and consequently maximizing the variation between groups. A good model has low sum of squares within each group.
We define sum of squares in the traditional way. Let Ck be the kth cluster and let µk be the mean of the observations in cluster Ck. Then the within group sum of squares for cluster Ck is defined as:
SS(k) = X ||xi − µk||2
i∈Ck
Please note that the term ||xi − µk||2 is the euclidean distance between xi and µk.
If there are K clusters total then the “total within group sum of squares” is just the sum of all K of these individual SS(k) terms.
2.4 Pair-counting measures
Given that we know the actual assignment labels for each data point we can attempt to analyze how well the clustering recovered this. WE will performance criteria which are based on two principles: i) two data points that belong to the same class should be assigned to the same cluster; and ii) two data points that belong to different classes should be assigned to different clusters. Formally speaking, consider all pairs of same-class data points, let p1 be the percentage of pairs of which both data points were assigned to the same cluster. Consider all pairs of different-class data points, let p2 be the percentage of pairs of which two data points are assigned to different clusters. Let p3 be the average of these two values p3 = (p1+p2)/2, which summarizes the clustering performance.
2.5 Questions
When you have implemented the algorithm please report the following:
1. The values of the total within group sum of squares and pair-counting measures (p1,p2,p3) for k = 2, k = 4 and k = 6. Start your centers with the first k points in the dataset. So, if k = 2, your initial centroids will be ID 1 and ID 2, which correspond to the first two lines in the file.
2. The number of iterations that k-means ran for k = 6, starting the centers as in the previous item. Make sure you count the iterations correctly. If you start with iteration i = 1 and at i = 4 the cluster assignments don’t change, the number of iterations was 4, as you had to do step 2 four times to figure this out.
3.A plot of the total within group sum of squares versus k for k = 1,2,3,4,5,6,7,8,9,10. Start your centers randomly (choose k points from the dataset at random).
4. [8pts] A plot of p1,p2,p3 versus k for k = 1,2,3,4,5,6,7,8,9,10. Start your centers randomly (choose k points from the dataset at random).
For the last two items, you should run k-means algorithm several times (e.g., 10 times) and average the results. For each question, submit a single plot, which is the average of the runs.
3 Programming Question (scene classification) [40 points + 10 bonus]
This question gives you the opportunities to learn LibSVM, which is one of the best software tool for classification problem. You will train SVMs with different kernels and use them to classify scenes from The Big Bang Theory, your favorite TV series. To classify an image, you will use Bag-of-Word representation, which is one of the most popular image representation. This representation views an images as the histogram of image features, or visual words. You MUST use LibSVM and your K-means implementation from Question 3.
3.1 LibSVM installation
First download LibSVM https://www.csie.ntu.edu.tw/˜cjlin/libsvm/. Follow the instruction in README to install LibSVM for Matlab or Python. Two main functions of LibSVM that you should pay attention to are: svmtrain and svmpredict. Note that Matlab also has a machine learning toolbox that comes with these two functions with exactly the same names. However, Matlab’s SVM implementation is not as good as LibSVM, so you need to make sure that you are using svmtrain and svmpredict from LibSVM. To check if you have installed the program correctly, in Matlab do:
which svmtrain
which svmpredict
Matlab should return the paths to the svmtrain and svmpredict of LibSVM. To learn how to use these functions, type the names of the function in Matlab:
svmtrain
svmpredict
3.2 Data
Training and test images are provided in the subdirectory bigbangtheory. The training image ids and labels are given in train.mat. This file contains two variables: imgIds and lbs. imgIds is a column vector and each row has a name of image in the training set. lbs is a matrix denoting the label for the image with the corresponding index. There are total 8 classes for the dataset: living room (1), kitchen (2), hallway (3), Penny’s living room (4), cafeteria (5), Cheesecake factory (6), laundry room (7), and comic bookstore (8).
Validation set is not provided for this question. You have to do cross validation to find the parameter for the best performance. You can implement cross validation by yourself, or you can use LibSVM functionality. Image ids for test set are given in test.mat.
3.3 Helper functions
A number of Matlab helper functions are provided. Also, some functions are left unfinished and you have to complete them.
1. Use HW5BoW.learnDictionary() to learn visual vocabulary for scene representation. You have to fill your K-means implementation from Question 3 in this function.
2. Use HW5 BoW.cmpFeatVecs() to compute feature vectors. This function will return the histogram of visual words for each image, which is our feature vector.
3.4 What to implement?
1. Complete the function HW5BoW.learnDictionary(). This function learns a visual vocabulary by running k-means on random image patches. Learn a visual dictionary with K = 1000 clusters.
2. (10 points) Using SVM with RBF kernel, report the 5-fold cross-validation accuracy for the default kernel parameters for C and γ.
3. (10 points) Tune C and γ for SVM with RBF kernel using 5-fold cross validation. Report the values of C, γ, and the 5-fold cross-validation accuracy.
4. Unfortunately, LibSVM does not support exponential X2 kernel, so you will need to implement it.
Implement a function:
[trainK, testK] = cmpExpX2Kernel(trainD, testD, gamma)
that takes train and test data and the kernel parameter gamma and return the train and test kernels. Recall that the exponential X2 kernel value for two d-dimensional feature vector x and y is:
!
(1)
The term is added to avoid division by 0.
5. LibSVM allows using pre-computed kernels. Train an SVM with exponential X2 kernel. Report the 5-fold cross validation accuracy with the best tuned values of C and γ. Note that a good default value of γ is the average of X2 distance between training data points, as discussed in the lecture.
6. Use your model to predict on the test set. Save the predicted labels on a CSV file predTestLabels.csv (see Section 4.3 for the format). The order of predicted labels should correspond to the order of the reviews in test.mat. Submit this predTestLabels.csv file on Kaggle and report classification accuracy in your answers.pdf file. You must use Stony Brook NetID email to register for Kaggle.
Note: Marks for this question will be scaled according to the ranking on the Private Leaderboard.
7. We will maintain a leader board on Kaggle, and the top three entries at the end of the competition (due date) will receive 10 bonus points. You must use your real names and Stony Brook Net ID on Kaggle to compete for the bonus points.
To prevent exploiting test data, you are allowed to make a maximum of 3 submissions per 24 hours. Your submission will be evaluated immediately and the leader board will be updated.