$25
Download: Codebase
In this problem, you are asked to apply linear regression and logistics regression for binary classification. In particular, this problem shows that linear regression is a bad model for classification problems.
We consider a binary classification problem, where the input is a two-dimensional vector and the output is
{0,1}. In other words, . Specifically, we would train our classifiers on the following two datasets:
Dataset A:
In this dataset, positive samples are generated by a bivariate normal distribution , where . Negative samples are generated by another , where
. The covariance is the same as the positive samples.
We have 400 samples in total, where 200 are positive and 200 are negative. The dataset is plotted in the left panel below.
Dataset B: We now construct a new dataset by shifting half of the positive (blue) samples to the upper right as shown in the right panel. In other words, the positive samples are generated by
with equal probability, where .
Dataset A Dataset B
Questions:
For each of Dataset A or Dataset B:
Train a classifier by thresholding a linear regression model. In other words, treat the target 0/1 labels as real numbers, and classify a sample as positive if the predicted value is greater than or equal to 0.5.
Train a logistic regression classifier on the same data.
Problem 2 [50%]
Download: Codebase
In this coding problem, we will implement the softmax regression for multi-class classification using the MNIST dataset.
Dataset
First, download the datasets from the link above. You need to unzip the .gz file by either double clicking or some command like gunzip -k file.gz
The dataset contains 60K training samples, and 10K test samples. Again, we split 10K from the training samples for validation. In other words, we have 50K training samples, 10K validation samples, and 10K test samples. The target label is among {0, 1, …, 9}.
Algorithm
We will implement stochastic gradient descent (SGD) for cross-entroy loss of softmax as the learning algorithm. The measure of success will be the accuracy (i.e., the fraction of correct predictions).
The general framework for this coding assignment is the same as SGD for linear regression, so you may re-use most of the code. However, you shall change the computation of output, the loss function, the measure of success, and the gradient whenever needed.
Implementation trick
For softmax classification, you may encounter numerical overflow if you just follow the equation mentioned in the lecture.
The observation is that the exp function increases very fast with its input, and very soon exp(z) will give NAN (not a number).
The trick is to subtract every by the maximum value .
In other words, we compute
, where , and then we have
Note that the gradient is computed with y, and since subtracting a constant before softmax doesn’t affect y, it doesn’t affect the gradient either.
Without changing the the default hyperparameters, we report three numbers:
The number of epoch that yields the best validation performance, 2. The validation performance (accuracy) in that epoch, and
The test performance (accuracy) in that epoch.
and two plots:
The learning curve of the training cross-entropy loss, and
The learning curve of the validation accuracy.
Ask one meaningful scientific question yourself, design your experimental protocol, present results, and draw a conclusion.
Note:
A scientific question means that we can give a verifiable hypothesis that can be either confirmed or declined.
Example of a scientific question: Is the learned classifier for this dataset better than majority guess?
Your hypothesis could be either yes or no, and it can be verified by experiments.
Example of a non-scientific question: Does the learned classifier become better if I have super-power? My hypothesis could be either yes or no, but cannot be verified by any experiment. I don’t know what superpower is, and I can say yes, or I can also say no. Neither is wrong, nor even correct.
A meaningful scientific question means that you’ll learn something from the experiment. Of course, what is meaningful itself is subjective. In terms of this coding assignment, the scientific question is considered meaningful as long as the student would learn something, or verify some results we mentioned in lectures. Example of a meaningful scientific question: How does linear regression perform for classification for this dataset? Doing this experiment will give us first-hand experience on why we shall not use regression models to do classification. But you cannot ask this question as the solution. You need to ask your own scientific question that interests you and/or inspires others.
Example of a not-so-meaningful scientific question: Is the learned classifier for this dataset better than majority guess? Ok, this question is considered scientific, but too trivial for us, although it is not necessarily trivial for those who don’t know machine learning at all. [Again this shows the subjectivity of evaluating the significance of science.]