Starting from:

$30

CSE512 Homework 3 Solved

Question 1 – Nearest Neighbor Classifiers (40 points)
1.1      1-NN with asymmetric loss (20 points)

Suppose we want to build a binary classifier, but the cost of false positive (predicting positive for a negative case) is much higher than the cost of false negative (predicting negative for a positive case). One can consider an asymmetric loss function, where the cost of false negative is 1, while the cost of false positive is α 1. For a point x, let η(x) be the probability that x is positive.

1.1.1     (3 points)

Show that the optimal Bayes risk for data point x is r∗(x) = min{η(x),α(1 − η(x))}.

1.1.2     (4 points)

Let r(x) be the asymptotic risk for data point x, express r(x) in terms of α and η(x).

1.1.3     (10points)

Prove that r(x) ≤ (1 + α)r∗(x)(1 − r∗(x)). Hint: use α 1

1.1.4     (3 points)

Let R be the asymptotic risk of the 1-NN classifier and R∗ be Bayes risk. Prove that: R ≤ (1+α)R∗(1−R∗)

1.2 k-NN classifier (20 points)

Consider a k-NN classifier: classify a point as positive if at least (k+1)/2 nearest neighbors are positive.

1.2.1     (3 points)

Consider drawing k points randomly from a Bernoulli distribution with two outcomes: positive or negative, and the probability of the point being positive is η. Let g(η,k) be the probability that at least (k + 1)/2 out of k points are positive. Express the asymptotic risk r(x) for a point x in terms of η(x) and the function g(·,·).

1.2.2      (10 points)

Prove that r(x) = r∗(x) + (1 − 2r∗(x))g(r∗(x),k)

1.2.3     (3 points)

Using Hoeffding’s Inequality (https://en.wikipedia.org/wiki/Hoeffding_inequality),

prove that:
 
g(r∗(x),k) ≤ exp(−2(0.5 − r∗(x))2k)
(1)
1.2.4     (4 points)

Prove that: . Hint: you should use the above inequality Eq. (1). Note that: from this result, you can see that the Asymptotic risk of k-NN classifier is the Bayes Risk if k goes to infinity.

2 Question 2 – Implementation of Logistic Regression Classifier for k classes (60 points + 10 bonus)
In this Question, you will implement Logistic Regression using Stochastic Gradient Descent (SGD) optimization. Suppose the training data is {(X1,Y 1),··· ,(Xn,Y n)}, where Xi is a column vector of d dimensions and Y i is the target label. For a column vector X, let X¯ denotes [X;1], the vector obtained by appending 1 to the end of X. θ is the set of parameters θ1,θ2,...,θk−1. Logistic regression for k classes assumes the following probability function:

                                                                                                                                                                      ,                                    (2)

(3)

Logistic regression minimizes the average conditional log likelihood:

                                                                      .                                                                 (4)

To minimize this loss function, we can use gradient descent:

(5)

where(6)

This gradient is computed by enumerating over all training data. It turns out that this gradient can be approximated using a batch of training data. Suppose B is a subset of {1,2,··· ,n}

                                                                                                                       (7)

This leads to the following stochastic gradient descent algorithm:



Algorithm 1 Stochastic gradient descent for Logistic Regression



1: Inputs:  (for data), m (for batch size), η0,η1 (for step size), max epoch, δ (stopping criteria)

2: for epoch = 1,2,...,max epoch do

3:             η ← η0/(η1 + epoch)

4:                  (i1,...,in) = permute(1,...,n)

5:          Divide (i1,...,in) into batches of size m or m + 1 6: for each batch B do

7:                                  Update θ using Eqs. (5) and (7)

8: Break if L(θnew) (1 − δ)L(θold) // not much progress, terminate 9: Outputs: θ.



2.1     Derivation (10 points)

Prove that:

                                                    .                                               (8)

where δ(c = Y i) is the indicator function which takes a value of 1 if the class c equals the ground truth label Y i, and 0 otherwise. Use Equation (8) to derive the gradient of the loss with respect to the parameters θ1,θ2,...,θk−1.

2.2      Crowd Image Classification

In this question of the homework, you will work with image data. We are providing you with the features extracted from the crowd images, so you do not need to extract features from the raw images. We are also providing you with the raw images for error analysis and visualization purposes. Your task is to classify an image into 4 different categories. The data has already been split into train, validation, and test sets.

Dataset details (obtain data from Kaggle competition page)

—- train set (4000 images)

—- val set (2000 images)

—- test set (2000 images)

2.3        Implement Logistic Regression with SGD (50 points + 10 bonus)

Your task is to implement Logistic Regression with k = 4 classes using SGD. You should use Python or Matlab for your implementation.

1.    (15 points) Run your implementation on the provided training data with max epoch = 1000,m = 16,η0 = 0.1,η1 = 1,δ = 0.00001.

(a)    Report the number of epochs that your algorithm takes before exiting.

(b)   Plot the curve showing L(θ) as a function of epoch.

(c)    What is the final value of L(θ) after the optimization?

2.    (10 points) Keep m = 16,δ = 0.00001, experiment with different values of η0 and η1. Can you find a pair of parameters (η0,η1) that leads to faster convergence?

(a)    Report the values of (η0,η1). How many epochs does it take? What is the final value of L(θ)?

(b)   Plot the curve showing L(θ) as a function of epoch.

3.    (10 points) Evaluate the performance on validation data

(a)    Plot L(θ) as a function of epoch. On the same plot, show two curves, one for training and one for validation data.

(b)   Plot the accuracy as a function of epoch. On the same plot, show two curves, one for training and one for validation data.

4.    (5 points) With the learned classifier:

(a) Report the confusion matrices on the validation and the training data.

More products