Starting from:

$30

FMAN45 Assignment 2 Solved

1                 Solving a nonlinear kernel SVM with hard constraints.

In the lecture we have seen the linear Support Vector Machine (SVM) for binary classification, which aims to find parameters w ∈ Rd and a bias term b ∈ R such that w>xi +b ≥ 1 if the example xi ∈ Rd belongs to the positive class (yi = +1), and wxi + b ≤ −1 when xi belongs to the negative class (yi = −1). The size of the margin is inverse proportional to the length of ||w||. The margin should be as large as possible, which corresponds to minimizing ||w||, or rather   for computational reasons. In this section and the next, we will consider the theoretical aspects of SVM classifiers. First, we examine the linear hard margin SVM, which is defined as the solution to the minimization problem:

 minimize(1)

w,b

                                                             subject to         yi(w>xi + b) ≥ 1, ∀i

Consider the one-dimensional binary classification problem with classes ‘+1’ and ‘-1’:

                                                                                                                                        (2)

As can be seen, the dataset is not linearly separable. Instead of trying to solve the above SVM problem, we therefore consider the (non-linear) feature map

                                                                                                                                                             (3)

and the corresponding kernel is given by k(x,y) = φ(x)>φ(y) and the kernel matrix by

K = [k(xi,xj)]1≤i,j≤4

Task T1: (5 p) Compute the kernel matrix K using the data from the table.

The Lagrangian dual problem for the (hard margin) SVM is given by:

                                                                             4                                 4
(4)
 (5)

                                                                             =1                         

                                                 subject to            αi ≥ 0 and Xyiαi = 0,             ∀i

i=1

Task T2: (5 p) Solve the maximization problem in (5) for α numerically, using the data in (2). You may use, without a proof that, for the data in (2), the solution satisfies α =

α1 = α2 = α3 = α4.

Next, we know that, for any support vector xs, we have

                                                                                                                           (6)

and that the equation for the classifier is given by

                                                                                                                                    (7)

Task T3: (10 p) For the data-target pairs in (2), reduce the classifier function (7) to the most simple the simplest possible form, leading to a simple polynomial in x.

Now consider another binary classification problem with classes ‘+1’ and ‘-1’, and data:

                                                                                                        (8)

Task T4: (5 p) With the same kernel k(x,y) as above, what is the solution g(x) of the nonlinear kernel SVM with hard constraint on the dataset in (8)? Explain how you got to this solution.

2           The Lagrangian dual of the soft margin SVM
In the soft-margin SVM, we allow errors ξi in this classification tasks for each data point xi. We denote the collection of all ξi by ξ. The primal formulation of the linear soft margin classifier is given by

 minimize(9)

w,b,ξ

                                                            subject to         yi(w>xi + b) ≥ 1 − ξi

ξi ≥ 0

Task T5: (10 p) Show that the Lagrangian dual problem for (9) is given by

                                                                             n                               n         n

                                                                                              (10)

1            i=1                 i=1 j=1 subject to 0 ≤ αi ≤ C

n

X

αiyi = 0

i=1 Task T6: (5 p) Use complementary slackness (of the KKT conditions) to show that support vectors with yi(w>xi + b) < 1 have coefficient αi = C.

3           Dimensionality reduction on MNIST using PCA
In the experimental part of this assignment you shall work with the MNIST image data set of handwritten digits, containing data-target pairs with images Xi ∈ [0,1]28×28 and targets ti ∈ {0,1,...,9}. You will only consider a part of the data set, specifically only images with targets ti ∈ {0,1}. The dataset MNIST_01.mat contains 12665 training datatarget pairs {train_data, train_labels } and 2115 test data-target pairs {test_data, test_labels }, where for both sets the images have been stacked column-wise, i.e., xi = vec(Xi) ∈ R282

The MNIST images are of quite low resolution, but the dimensionality is still quite large, i.e., D = 282 = 784. One way of visualizing a collection of high-dimensional data examples is to use a dimensionality reduction technique. Principal component analysis (PCA) is a method for reducing the dimensionality of a dataset, while retaining as much of the data’s variability as possible. PCA utilizes the singular value decomposition (SVD), which for a matrix of N zero-mean data examples of dimension D, i.e., X ∈ RD×N, is the factorization:

                                                                                      X = USV>                                                                                                                          (11)

where for P = min(D,N); U ∈ RD×P , S = diag(s),s ∈ RP , and V ∈ RN×P are the left singular vectors, the diagonal matrix of P many singular values, and the right singular vectors, respectively. The dimensionality reduction of X to dimension d is then the projection of X onto the d left singular vectors with the largest absolute singular values.

Task E1: (5 p) Compute a linear PCA and vizualize the whole training data in d = 2 dimensions. Make sure that the necessary condition to apply PCA is met. Display your results in a plot where the data examples are marked with different markers and colors for each of the two classes. Make the plot clearly interpretable by using legends, adjusting font size, etc.

4           Clustering of unsupervised data using K-means
In this section, you shall cluster the MNIST images without using their target values. Kmeans clustering is an unsupervised approach for computing K many clusters by iteratively determining the so-called centroids of each cluster and assigning each sample to a cluster. The centroid of a cluster is defined as the mean location of all samples within the cluster. In the assignment step, each sample gets assigned to the closest centroid. Algorithm 1 summarizes the K-means clustering approach.

 Algorithm 1 K-means clustering

Select number of clusters K, convergence threshold  , and maximum iterations jmax.

Initialize  randomly.

for j = 1,...,jmax do

Step 1: Assign examples to clusters:

di = f (xi,C(j−1)),i = 1,...,N yi =  

Step 2: Assign new cluster centroids:

 , k = 1,...,K

c , k = 1,...,K

Check convergence:

if  tol then return y(j),C(j)

end if

end for

return y(jmax),C(jmax)

 

Task E2: (20 p) Implement K-means clustering for the training data, using the provided function sketch K_means_clustering.m. Do so by completing the following steps:

1.    Define and implement a distance function fxdist(x,C) which computes a distance of your chosing between a single example and the K centroids.

2.    Define and implement a pairwise distance function, fcdist(C,C˜), which computes a distance of your choosing between two cluster centroids, such that the distance is exactly zero when the pairs of centroids agree.

3.    Construct the first step of the K-means clustering algorithm: Define and implement a function step_assign_cluster which assigns each sample to one of the K clusters.

4.    Construct the second step of the K-means clustering algorithm: Define and implement a function step_compute_mean which assigns new centroids by computing the means of the newly assigned clusters, outputting both the new centroids and a measure of the distance which the centroids have moved. The latter will be used as a stopping criterion; the algorithm will stop when the centroids have changed less than a specificed threshold.

5.    Implement a fully running K_means_clustering using the components you’ve implemented above, and run the algoritm on the original training data, for K = 2 and K = 5 clusters. Reuse the pre-trained linear PCA from Task E1 only to visualize your results in two dimensions for each K in a plot where the data examples are marked with a different marker and color for each cluster. Make the plot clearly interpretable by using legends, adjusting font size, etc.

Important:

•    Use the provided function sketch K_means_clustering.m. Do not implement the function from scratch.

•    Only make changes to K_means_clustering.m in the lines between % CHANGE and % DO NOT CHANGE. Do not change other parts of the function.

•    Explain why the clusters seem to overlap for, e.g., K = 5.

We could now check only a few samples from each cluster to assign a suitable label to the centroid. Having assigned a label to a centroid, we assign the same label to all samples in the corresponding cluster. Alternatively, we can plot the centroid image and assign a suitable class manually. These approaches are advantageous if labeling of all samples individually is expensive. Here, we have the privileged situation to know target labels for all samples, so we may also assign the label to a centroid that is the most frequent in the cluster. Assigning the same label to all samples in the cluster will lead to some misclassifications. By summing the number of misclassifications in each cluster, a missclassification rate may be calculated for the dataset.

Task E3: (5 p) Display the K = 2 and k = 5 centroids as images. Make a plot with 1 × K subplots illustrating the centroids from each cluster as an image, using the Matlab function imshow(). Make legends, titles or otherwise mark which cluster is displayed. Note that you have to stack the centroids back into the shape Xi ∈ R28×28 to display the image properly, using, e.g., reshape().

Task E4: (5 p) Now you shall use K-means clustering for classification, by completing the following steps:

1.    For K = 2, implement a function K_means_classifier which assigns to a given example the label of the closest centroid. You may use the distance function calculated in Task E2. Then assign each cluster centroid the label of which it has the most examples of in the training data.

2.    Evaluate how many misclassifications occur for the train and test set train_data and test_data by comparing the computed cluster labels to the true labels in train_labels and test_labels. Then produce and fill in the following table (which is provided in the supplementary material):

 Tabell 1: K-means classification results

Training data
Cluster           # ’0’        # ’1’         Assigned to class
# misclassified
 
         1                ?              ?                          ?
?
 
         ...               ...            ...                         ...
...
 
         K                ?              ?                          ?
?
Ntrain = ?
Sum misclassified:
?
 
Misclassification rate (%):
?
Testing data
Cluster
# ’0’
# ’1’             Assigned to class
# misclassified
 
1
?
       ?                          ?
?
 
...
...
       ...                         ...
...
 
K
?
       ?                          ?
?
Ntest = ?
 
 
Sum misclassified:
?
 
 
 
Misclassification rate (%):
?
 

Task E5: (5 p) Try out some different K values. Can you lower the misclassification rate further on test data by considering a different number of clusters K?

5           Classification of MNIST digits using SVM
In the previous section, you implemented a classifier using unsupervised data, by virtue of the data structure itself. In this section, you shall consider a supervised classifier, namely the support vector machine (SVM). Your task is to use the soft-margin SVM for binary classification, which solves the optimization problem

                                                                                                                                                                                       (12)

  w

subject to

where yi = 1 for target ti = 1, and yi = −1 for target ti = 0.

Task E6: (5 p) Use the supervised training data {training_data, training_labels} to train a linear SVM classifier:

1.    One trains the soft-margin SVM by solving the Langrangian dual problem derived in Task T5 using numerical methods. Train a binary soft-margin SVM using Matlab’s built-in function fitcsvm(X,T) on the training data train_data and targets train_labels, where X denotes the N ×D matrix of data examples, and T the target vector of length N. Note that the data provided is of size D × N, so you need to transpose it to match the input format. The parameter C in the SVM problem has to be set. The Matlab implementation uses a default value of C = 1.0, which you may leave unchanged.

2.    Calculate class predictions using the built-in Matlab function predict(model,X), where model is the output from fitcsvm(), and X is the N × D dataset matrix to do predictions on. Then evaluate the misclassification rate on both the training and test data by filling in the following table (which is provided in the supplementary material):

 Tabell 2: Linear SVM classification results

Training data
Predicted class
True class:              # ’0’
# ’1’
 
’0’
?
?
 
’1’
?
?
Ntrain = ?
Sum misclassified:
?
 
Misclassification rate (%):
?
Testing data
Predicted class
True class:              # ’0’
# ’1’
 
’0’
?
?
 
’1’
?
?
Ntest = ?
Sum misclassified:
?
 
Misclassification rate (%):
?
 

Next, you shall look at SVM with kernels, more specifically a Gaussian kernel. The kernel SVM solves the SVM optimization problem not on the data directly, but in a feature space z = φ(x), such that the classifier is linear in z. In practice, this is done using the so called kernel trick, such that data is never mapped into this feature space explicitly, but by modifying the dual problem appropriately.

Task E7: (10 p) Use the supervised data again to train a non-linear kernel SVM classifier, using a Gaussian kernel:

1.    Train an SVM with Gaussian kernel using Matlab’s built-in function, by specifying fitcsvm(X,T,’KernelFunction’,’gaussian’).

2.    The Gaussian kernel has a scaling parameter σ2, which in Matlab’s SVM estimator is set to β = p1/σ2 = 1 by default, but may be modified by specifying fitcsvm(X,T, ’KernelFunction’,’gaussian’, ’KernelScale’,beta) for some value beta. Can you lower the misclassification rate on the test data by tuning beta?

3.    Present your best results, similar to Task E5, by filling in the following table (which is provided in the supplementary material):

 Tabell 3: Gaussian kernel SVM classification results

Training data
Predicted class
True class:           # ’0’        # ’1’
 
’0’
                                   ?                  ?
 
’1’
                                   ?                  ?
Ntrain = ?
                                         Sum misclassified:               ?
 
                         Misclassification rate (%):                ?
Testing data
Predicted class
True class:              # ’0’
# ’1’
 
’0’
?
?
 
’1’
?
?
Ntest = ?
Sum misclassified:
?
 
Misclassification rate (%):
?
 

Task E8: (5 p) We can achieve very low misclassification rate on both train and test data with a good choice of parameters for the Gaussian kernel SVM. Can we therefore expect the same error on new images? Explain why such a conclusion would be false.

A           Supplementary material
Contents of data archive A2_data.mat used in sections 3-5:

train_data_01 MNIST images of ’0’ and ’1’ for training, X ∈ R784×12665 train_labels_01 MNIST labels ’0’ and ’1’ for training, t ∈ R12665 test_data_01       MNIST images of ’0’ and ’1’ for testing, X ∈ R784×2115 test_labels_01             MNIST labels ’0’ and ’1’ for testing, t ∈ R2115

Provided files:

•    K_means_clustering.m

Matlab function sketch [y,C] = K_means_clustering(X,K), returns cluster assignments y and cluster centroids C given an input of data X the number of clusters K.

•    A2_tables.tex:

LATEX-formatted tables from Section 3 and 4, to be filled in and used in the report.

More products