Starting from:

$25

CPTS570-Homework 2 Solved

1         Analytical Part
This part will be graded as a PASS or FAIL.

1.    Suppose we have n+ positive training examples and n− negative training examples. Let C+ be the center of the positive examples and C− be the center of the negative examples, i.e.,   and . Consider a simple classifier called CLOSE

that classifies a test example x by assigning it to the class whose center is closest.

•    Show that the decision boundary of the CLOSE classifier is a linear hyperplane of the form sign(w · x + b). Compute the values of w and b in terms of C+ and C−.

•    Recall that the weight vector can be written as a linear combination of all the training examples:  . Compute the dual weights (α’s). How many of the training examples are support vectors?

2.    Suppose we use the following radial basis function (RBF) kernel: ), which has some implicit unknown mapping φ(x).

•    Prove that the mapping φ(x) corresponding to RBF kernel has infinite dimensions.

•    Prove that for any two input examples xi and xj, the squared Euclidean distance of their corresponding points in the higher-dimensional space defined by φ is less than 2, i.e., kφ(xi) − φ(xj)k2 ≤ 2.

3.    The decision boundary of a SVM with a kernel function (via implicit feature mapping φ(.)) is defined as follows:

w · φ(x) + b = Pi∈SV yiαiK(xi,x) + b = f(x;α,b)

, where w and b are parameters of the decision boundary in the feature space phi defined by the kernel function K, SV is the set of support vectors, and αi is the dual weight of the ith support vector.

Let us assume that we use the radial basis function (RBF) kernel ); also assume that the training examples are linearly separable in the feature space φ and SVM finds a decision boundary that perfectly separates the training examples.

If we choose a testing example xfar that is far away from any training instance xi (distance here is measured in the original feature space <d). Prove that f(xfar;α,b) ≈ b.

4.    The function K(xi,xj) = −hxi,xji is a valid kernel. Prove or Disprove it.

5.    You are provided with n training examples: (x1,y1),(x2,y2),···,(xn,yn,), where xi is the input example, yi is the class label (+1 or -1). The teacher gave you some additional information by specifying the costs for different mistakes C+ and C−, where C+ and C− stand for the cost of misclassifying a positive and negative example respectively.

a. How will you modify the Soft-margin SVM formulation to be able to leverage this extra information? Please justify your answer.

6.    You are provided with a set of n training examples: (x1,y1),(x2,y2),···,(xn,yn,), where xi is the input example, yi is the class label (+1 or -1). Suppose n is very large (say in the order of millions). In this case, standard SVM training algorithms will not scale due to large training set.

Tom wants to devise a solution based on “Coarse-to-Fine” framework of problem solving. The basic idea is to cluster the training data; train a SVM classifier based on the clusters (coarse problem); refine the clusters as needed (fine problem); perform training on the finer problem; and repeat until convergence. Suppose we start with k+ positive clusters and k− negative clusters to begin with (a cluster is defined as a set of examples). Please specify the mathematical formulation (define all the variables used in your formulation) and concrete algorithm for each of the following steps to instantiate this idea:

a)   How to define the SVM training formulation for a given level of coarseness: a set of k+ positive clusters and a set of k− negative clusters?

b)  How to refine the clusters based on the resulting SVM classifier?

c)   What is the stopping criteria?

Optional question: For what kind of problems will this solution fail?

7.    You are provided with a set of n training examples: (x1,y1),(x2,y2),···,(xn,yn,), where xi is the input example, yi is the class label (+1 or -1). Suppose n is very large (say in the order of millions). In this case, online kernelized Perceptron algorithms will not scale if the number of allowed support vectors are unbounded.

a)   Suppose you have trained using kernelized Perceptron algorithm (without any bounds onsupport vectors) and got a set of support vectors SV . Tom wants to use this classifier for real-time prediction and cannot afford more than B kernel evaluations for each classification decision. Please give an algorithm to select B support vectors from SV . You need to motivate your design choices in order to convince Tom to use your solution.

b)  Tom wants to train using kernelized Perceptron algorithm, but wants to use at most B support vectors during the training process. Please modify the standard kernelized Perceptron training algorithm (from class slides) for this new setting. You need to motivate your design choices in order to convince Tom to use your solution.

2         Programming and Empirical Analysis Part
1.    Empirical analysis question. You can use a publicly available SVM classifier implementation(e.g., scikit-learn) for SVM related experiments. scikit-learn (http://scikit-learn.org/ stable/modules/svm.html).

You will use the Fashion MNIST data (https://github.com/zalandoresearch/fashion-mnist). There is a fixed training and testing set. From training data, use first 80 percent for “training” and last 20 percent as “validation data”.

Each example is a 28x28 grayscale image, associated with a label from 10 classes: 0 Tshirt/top, 1 Trouser, 2 Pullover, 3 Dress, 4 Coat, 5 Sandal, 6 Shirt, 7 Sneaker, 8 Bag, 9 Ankle boot.

You will use ONLY training data for training and testing data for evaluation.

(a)  Using a linear kernel, train the SVM on the training data for different values of C parameter: 10−4,10−3,10−2,10−1,100,101,102,103,104. Compute the training accuracy, validation accuracy, and testing accuracy for the SVM obtained with different values of the C parameter. Plot the training accuracy, validation accuracy, and testing accuracy as a function of C (C value on x-axis and Accuracy on y-axis) – one curve each for training, validation, and testing data. Also, plot the number of support vectors (if applicable for the SVM toolkit you are using) as a function of C. List your observations.

(b)  Select the best value of hyper-parameter C based on the accuracy on validation set and train a linear SVM on the combined set of training and validation examples. Compute the testing accuracy and the corresponding confusion matrix: a 10 × 10 matrix.

(c)  Repeat the experiment (a) with the best C value from (a) with polynomial kernel of degree 2, 3, and 4. Compare the training, validation, testing accuracies, and the number of support vectors for differnt kernels (linear, polynomial kernel of degree 2, polynomial kernel of degree 3, and polynomial kernel of degree 4). List your observations.

2.    Programming question. You will implement the kernelized Perceptron training algorithm (discussed in the class) for multi-class classification.

(a)  You will use the Fashion MNIST data. Train the kernelized Perceptron classifier for 5 iterations with polynomial kernel (pick the best degree out of 2, 3, and 4 from the above experiment). Plot the number of mistakes as a function of training iterations. Compute the training, validation, and testing accuracy at the end of 5 iterations.

3.    Programming question. “Breast Cancer” Classifier using Decision Trees. You will use thefollowing dataset for this question: https://archive.ics.uci.edu/ml/datasets/Breast+ Cancer+Wisconsin+%28Diagnostic%29. You will use the first 70 percent examples for training, next 10 percent examples for validation, and last 20 percent examples for testing.

(a)  Implement the ID3 decision tree learning algorithm that we discussed in the class. Thekey step in the decision tree learning is choosing the next feature to split on. Implement the information gain heuristic for selecting the next feature. Please see lecture notes or https://en.wikipedia.org/wiki/ID3_algorithm for more details. Do the following to select candidate thresholds for continuous features: Sort all candidate values for feature f from training data. Suppose f1,f2,···,fn is the sorted list. The candidate thresholds are chosen as fi + (fi+1 − fi)/2 for i=1 to n.

(b)  Run the decision tree construction algorithm on the training examples. Compute theaccuracy on validation examples and testing examples.

(c)  Implement the decision tree pruning algorithm discussed in the class (via validation data).

(d)  Run the pruning algorithm on the decision tree constructed using training examples.Compute the accuracy on validation examples and testing examples. List your observations by comparing the performance of decision tree with and without pruning.

More products