$25
Problem 2 Classification
For this assignment you are asked to implement both binary and multiclass classification using gradient descent to update weight and bias in each iteration. PLEASE EDIT bm_classify.py ONLY. Any other changes that cause system issues may lead to 0 point in this part.
Q2.1 Binary Classification - Perceptron vs. Logistic (28 points)
In our lecture we have been discussing how to do binary classification using different loss. Now it is your time to implement perceptron loss and logistic loss for binary classification. In this problem you are given training set D={(xn,yn)n=1N},yi∈{0,1}∀i=1...N.D={(xn,yn)Nn=1},yi∈{0,1}∀i=1...N. Please note that the labels are not +1 or -1 as we discussed in lectures, so think carefully before you apply formula to this question. Your task is to learn a model wTx+bwTx+b that minimize the given loss. Please find the gradients of the two loss functions by yourself and apply average gradient descent to update w,bw,b in each iteration. For perceptron loss we define Z=yn(wTxn+b)0Z=yn(wTxn+b)0 as correctly classified data.
(8 points) TODO 1 For perceptron loss that is find the minimizer of
F(w,b)=∑n=1NLperceptron(yn(wTxn+b))=∑n=1NMAX(0,−yn(wTxn+b))F(w,b)=∑n=1NLperceptron(yn(wTxn+b))=∑n=1NMAX(0,−yn(wTxn+b))
(8 points) TODO 2 For logistic loss that is find the minimizer of
F(w,b)=∑n=1NLlogistic(yn(wTxn+b))=∑n=1Nln(1+e−yn(wTxn+b))F(w,b)=∑n=1NLlogistic(yn(wTxn+b))=∑n=1Nln(1+e−yn(wTxn+b))
(4 points) TODO 3 Also you will find out it is convenient to use sigmoid fuction σ(z)=(1+e−z)−1σ(z)=(1+e−z)−1 for logistic loss, so please complete it. You can use this function in TODO 2.
(4 points for each)TODO 4 TODO 5 After you learn the models, how do you know it is good enough? The intuitive way is to make some predictions to see if those predicted results are correct or not. Here we want you complete the prediction functions. It will be like something greater than 0 and something put into sigmoid function greater than 0.5. You may find out an interesting fact here.
After you finish the 5 TODOs above in bm_classify.py, you can run binary.sh to test it. If your code is programmed correctly, you should see binary.out as an output file keeping taining and testing accurancies of two loss functions for three datasets. You will see how similar between perceptron loss and logistic loss. Quite interesting isn't it? Can you figure out why this happends? We will leave this open question to you.
Two of the datasets you are going to do binary classification:
Synthetic data:
Two Moon data:
Q2.2 Multiclass classification - SGD vs. GD (32 points)
Well done! Ready to take our next challenge? Let's get into multiclass classification. In this question you are going to build a model to classify data into more than just two classes. Also you are going to implement both SGDSGD and GDGD for multiclass classification and compare performances of the two approaches. Training dataset are similar to question Q2.1, but yi∈{0,1,...,C−1}∀i=1...N.yi∈{0,1,...,C−1}∀i=1...N. Your task is to learn models for multiclass classification based on minimizing logistic loss.
Here is a short review of SGDSGD.
From the lecture we know multiclass logistic loss is
F(W)=∑n=1Nln(1+∑k≠yne(wk−wyn)Txn).F(W)=∑n=1Nln(1+∑k≠yne(wk−wyn)Txn).
Here, wkwk is the weight vector for class kk and wynwyn is the weight vector for class ynyn. kk is in range [0, C-1]. Now we try to apply SGDSGD. First we randomly pick a data xnxn and minimize logistic loss
g(W)=ln(1+∑k≠yne(wk−wyn)Txn).g(W)=ln(1+∑k≠yne(wk−wyn)Txn).
And then find the derivative ▽wg(W)▽wg(W), where ▽wg(W)▽wg(W) is a CCxDD matrix.
Let's look at each row k.
If k≠ynk≠yn:
▽wkg(W)=e(wk−wyn)Txn1+∑k′≠yne(wk′−wyn)TxnxTn=P(k|xn;W)xTn▽wkg(W)=e(wk−wyn)Txn1+∑k′≠yne(wk′−wyn)TxnxnT=P(k|xn;W)xnT
else:
▽wkg(W)=−∑k′≠yne(wk′−wyn)Txn1+∑k′≠yne(wk′−wyn)TxnxTn=(P(yn|xn;W)−1)xTn▽wkg(W)=−∑k′≠yne(wk′−wyn)Txn1+∑k′≠yne(wk′−wyn)TxnxnT=(P(yn|xn;W)−1)xnT
where PP is softmax function.
In the end, our update for WW is
W←W−η⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢P(y=0|xn;W)⋮P(y=yn|xn;W)−1⋮P(y=C−1|xn;W)⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥xTnW←W−η[P(y=0|xn;W)⋮P(y=yn|xn;W)−1⋮P(y=C−1|xn;W)]xnT
That is the whole idea of SGDSGD of logistic loss for multiclass classification.
(8 points) TODO 6 Complete the SGDSGD part in def multiclass_train, and don't forget the bias bb. To randomly pick a data from dataset, you can use np.random.choice one time in each iteration.
(16 points) TODO 7 Complete the GDGD part in def multiclass_train. Compare to SGDSGD, GDGD does not randomly pick a data xnxn. Instead, GDGD considers all training data points to compute derivative. Please think about how to compute GDGD, and again we want average gradient descent. Also there is a tricky point. When dataset is large, GDGD will takes a large amount of time. How to reduce the time? Make sure you use numpy programming instead of nested for loops, otherwise you will not finish your test on Vocareum within the time limit.
Hint 1: If you need to run a for loop for NN times to accumulate CCxDD matrices, why not design an equivalent computation as a CCxNN matrix dot another NxDNxD matrix.
Hint 2: You may find it useful to use a special (one-hot) representation of the labels, where each label yiyi is represented as a row of zeros with a single 1 in the column, that corresponds to the class yiyi. So this one-hot should be an NNxCC matrix.
Advice: To avoid numerical issues such as overflow and underflow caused by np.exp. Let xx be a input vector to the softmax function. Use x~=x−max(x)x~=x−max(x) instead of using xx directly for the softmax function xx . That is, if you want to compute f(x)if(x)i , compute f(x~)i=exp(x~i)∑Dj=1exp(x~j)f(x~)i=exp(x~i)∑j=1Dexp(x~j) instead, which is clearly mathematically equivalent but numerically more stable.
(8 points) TODO 8 You need to complete the predict function in def multiclass_predict.
After you complete TODO 6 TODO 7 TODO 8, please run multiclass.sh to test your code. If your code is programmed correctly, you should see multiclass.out as a output file keeping processing time, taining and testing accurancies of SGDSGD and GDGD for each given dataset. You shall see how fast SGDSGD process compared to GDGD, but its accuracy is hard to catch up with the other's. One more open question to leave you. Is there any chance that SGDSGD reach to the same accuracy as GDGD does using less time?
That is all for this assignment.
Scatter plot of a sample toy dataset for multicalss classification: