Starting from:

$25

CSE512 -  Machine Learning  - Homework 3 - Solved

This homework contains 2 questions. The second question requires programming. The maximum number of points is 100.

1          Question 1 – Nearest Neighbor Classifiers (40 points)
1.1      1-NN with asymmetric loss (20 points)

Suppose we want to build a binary classifier, but the cost of false positive (predicting positive for a negative case) is much higher than the cost of false negative (predicting negative for a positive case). One can consider an asymmetric loss function, where the cost of false negative is 1, while the cost of false positive is α > 1. For a point x, let η(x) be the probability that x is positive.

1.1.1     (3 points)

Show that the optimal Bayes risk for data point x is r∗(x) = min{η(x),α(1 − η(x))}.

1.1.2     (4 points)

Let r(x) be the asymptotic risk for data point x, express r(x) in terms of α and η(x).

1.1.3     (10 points)

Prove that r(x) ≤ (1 + α)r∗(x)(1 − r∗(x)). Hint: use α > 1

1.1.4     (3 points)
Let R be the asymptotic risk of the 1-NN classifier and R∗ be Bayes risk. Prove that: R ≤ (1+α)R∗(1−R∗)

1.2 k-NN classifier (20 points)

Consider a k-NN classifier: classify a point as positive if at least (k+1)/2 nearest neighbors are positive.

1.2.1     (3 points)
Consider drawing k points randomly from a Bernoulli distribution with two outcomes: positive or negative, and the probability of the point being positive is η. Let g(η,k) be the probability that at least (k + 1)/2 out of k points are positive. Express the asymptotic risk r(x) for a point x in terms of η(x) and the function g(·,·).

1.2.2     (10 points)

Prove that r(x) = r∗(x) + (1 − 2r∗(x))g(r∗(x),k)

1.2.3     (3 points)
Using Hoeffding’s Inequality (https://en.wikipedia.org/wiki/Hoeffding_inequality),

prove that:
 
g(r∗(x),k) ≤ exp(−2(0.5 − r∗(x))2k)
(1)
1.2.4     (4 points)
Prove that: . Hint: you should use the above inequality Eq. (1). Note that: from this result, you can see that the Asymptotic risk of k-NN classifier is the Bayes Risk if k goes to infinity.

2          Question 2 – Implementation (60 points)
2.1     Implementation (20 points)

Implement a k-Nearest Neighbor classifier. Write a Python function with the signature:

yˆ,Idxs = knn classifier(Xtrain,ytrain,Xtest,k)

where

Inputs:

•   Xtrain: a two dimensional Numpy array of size n × d, where n is the number of training data points, and d the dimension of the feature vectors.

•   ytrain: a Numpy vector of length n. y[i] is a categorical label corresponding to the data point X[i,:], y[i] ∈ {0,1,...,k − 1}. You can assume that the number of classes k is the maximum entry of y plus 1.

•   Xtest: a two dimensional Numpy array of size m × d, where m is the number of test data points, and d the dimension of the feature vectors.

•   k: the number of nearest neighbors to use for classification

Outputs:

•   yˆ: a Numpy vector of length m for the predicted labels for the test data

•   Idxs: a Numpy array of size m × k for the indexes of the k nearest neighbor data points. Each row corresponds to a test data point.

Do not use/import sklearn.neighbors.KNeighborsClassifier

Hint: Use Euclidean distance to find the k nearest neighbors. You can use scipy.spatial.distance.cdist.

2.2       Experiment and result reporting (40 points)

In this section, you will run the kNN classifier on a subset of the MNIST dataset. MNIST is a dataset of images of handwritten digits (http://yann.lecun.com/exdb/mnist/) that is widely used in the field of machine learning. It includes digits from 0 to 9.

Step 1
The first step is to load your training and test data that are provided in the files mnist train.csv and mnist test.csv correspondingly. (Tip: To load the data from a csv file, you can use numpy.genfromtxt or csv.reader).

Inspect your data carefully: The first row of each csv file is the csv header. Each following row corresponds to a sample. The first column corresponds to the label of each sample, which is a number from 0 to 9. The rest 784 columns correspond to the features of the samples. Essentially, these 784 features are the pixel values of the corresponding original image of size 28 × 28.

Load your data to numpy arrays Xtrain,ytrain,Xtest,ytest, as integers. The feature values are in the range [0, 255]. Normalize the features of both the training and test sets by dividing them by 255, in order to convert them to float values in the range [0, 1] (e.g. Xtrain = Xtrain/255.).

Step 2
Visualize some of the samples to better understand your dataset. You can use the following code:

import numpy as np import matplotlib . pyplot as p l t

fig = p l t . figure ( ) for i in range ( 9 ) :

p l t . subplot (3 , 3 , i + 1) p l t . imshow ( np . reshape ( X  train [ i ] , (28 , 28))) p l t . t i t l e ( y   t r a i n [ i ] ) p l t . axis ( ” off ” )

p l t . t i g h t   l a y o u t ( ) p l t . show ( )

2.2.1     Question 2.2.1 (10 points)
Run the kNN classifier that you implemented in Question 2.1. Run it for different values of k = 2 ∗ m + 1, where m = 0,1,2,4,8,16,32. Plot the accuracy on the test data against k. Does k affect the performance of the classifier? (You can use sklearn.metrics.accuracy score.)

2.2.2     Question 2.2.2 (10 points)
Run the kNN classifier again for k = 3, but now using smaller subsets for your training set. Keep the first n rows of Xtrain,ytrain. Try n = 100,200,400,600,800,1000. Plot the accuracy on the test data against n. Does the number of training data affect the performance of the classifier?

2.2.3     Question 2.2.3 (10 points)
Now try different distance metrics. Run the kNN classifier again for k = 3, using the whole training set. Use “Manhattan” distance and report the accuracy on the test data. Is the result better than using the Euclidean distance?

2.2.4     Question 2.2.4 (10 points)
Run the kNN classifier again for k = 5, using the whole training set and Euclidean distance. Display the 5 nearest neighbors of 3 failed cases: Choose 3 test samples that are wrongly classified. For each of them, get their 5 nearest neighbors. Display each sample and its neighbors using plt.imshow, similarly to Step 2.


More products