$30
1. Comparing classifiers.
In lectures, we learn different classifiers. This question is compare them on two datasets. Python users, please feel free to use Scikit-learn, which is a commonly-used and powerful Python library with various machine learning tools. But you can also use other similar libraries in other languages of your choice to perform the tasks.
Part One (Divorce classification/prediction).
This dataset is about participants who completed the personal information form and a divorce predictors scale. The data is a modified version of the publicly available at https://archive.ics.uci. edu/ml/datasets/Divorce+Predictors+data+set (by injecting noise so you will not get the exactly same results as on UCI website). The dataset marriage.csv is contained in the homework folder. There are 170 participants and 54 attributes (or predictor variables) that are all real-valued. The last column of the CSV file is label y (1 means “divorce”, 0 means “no divorce”). Each column is for one feature (predictor variable), and each row is a sample (participant). A detailed explanation for each feature (predictor variable) can be found at the website link above. Our goal is to build a classifier using training data, such that given a test sample, we can classify (or essentially predict) whether its label is 0 (“no divorce”) or 1 (“divorce”).
We are going to compare the following classifiers (Naive Bayes, Logistic Regression, and KNN). Use the first 80% data for training and the remaining 20% for testing. If you use scikit-learn you can use train test split to split the dataset.
Remark: Please note that, here, for Naive Bayes, this means that we have to estimate the variance for each individual feature from training data. When estimating the variance, if the variance is zero to close to zero (meaning that there is very little variability in the feature), you can set the variance to be a small number, e.g.,. We do not want to have include zero or nearly variance in Naive Bayes. This tip holds for both Part One and Part Two of this question.
Report testing accuracy for each of the three classifiers. Comment on their performance: which performs the best and make a guess why they perform the best in this setting.
Now perform PCA to project the data into two-dimensional space. Build the classifiers(Naive Bayes, Logistic Regression, and KNN) using the two-dimensional PCA results. Plot the data points and decision boundary of each classifier in the two-dimensional space. Comment on the difference between the decision boundary for the three classifiers. Please clearly represent the data points with different labels using different colors.
Part Two (Handwritten digits classification). (35 points)
This question is to compare different classifiers and their performance for multi-class classifications on the complete MNIST dataset at http://yann.lecun.com/exdb/mnist/. You can find the data file mnist 10digits.mat in the homework folder. The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. We will compare KNN, logistic regression, SVM, kernel SVM, and neural networks.
We suggest you to “standardize” the features before training the classifiers, by dividing the values of the features by 255 (thus map the range of the features from [0, 255] to [0, 1]).
You may adjust the number of neighbors K used in KNN to have a reasonable result (you may use cross validation but it is not required; any reasonable tuning to get good result is acceptable).
You may use a neural networks function neural network with hidden layer sizes = (20, 10). • For kernel SVM, you may use radial basis function kernel, and a heuristic called “median trick”: choose the parameter of the kernel K(x,x0) = exp{−kx − x0k2/(2σ2)}. Choose the bandwidth
as σ = pM/2 where M = the median offor pairs of training samples. Here you can randomly choose m0 = 1000 samples from training data to use for the “median trick”[1].
For KNN and SVM, you can randomly downsample the training data to size m = 5000, to improve computation efficiency.
Train the classifiers on training dataset and evaluate on the test dataset.
(25 points) Report confusion matrix, precision, recall, and F-1 score for each of the classifiers. Forprecision, recall, and F-1 score of each classifier, we will need to report these for each of the digits. So you can create a table for this. For this question, each of the 5 classifier, KNN, logistic regression, SVM, kernel SVM, and neural networks, accounts for 10 points.
(10 points) Comment on the performance of the classifier and give your explanation why some ofthem perform better than the others.
2. Naive Bayes for spam filtering.
In this problem, we will use the Naive Bayes algorithm to fit a spam filter by hand. This will enhance your understanding to Bayes classifier and build intuition. This question does not involve any programming but only derivation and hand calculation.
Spam filters are used in all email services to classify received emails as “Spam” or “Not Spam”. A simple approach involves maintaining a vocabulary of words that commonly occur in “Spam” emails and classifying an email as “Spam” if the number of words from the dictionary that are present in the email is over a certain threshold. We are given the vocabulary consists of 15 words
V = {secret, offer, low, price, valued, customer, today, dollar, million, sports, is, for, play, healthy, pizza}.
We will use Vi to represent the ith word in V . As our training dataset, we are also given 3 example spam messages,
million dollar offer
secret offer today
secret is secret and 4 example non-spam messages
low price for valued customer
play secret sports today
sports is healthy
low price pizza
Recall that the Naive Bayes classifier assumes the probability of an input depends on its input feature.
The feature for each sample is defined as and the class of the ith sample is y(i). In our case the length of the input vector is d = 15, which is equal to the number of words in the vocabulary V . Each entry is equal to the number of times word Vj occurs in the i-th message.
(5 points) Calculate class prior P(y = 0) and P(y = 1) from the training data, where y = 0 corresponds to spam messages, and y = 1 corresponds to non-spam messages. Note that these class prior essentially corresponds to the frequency of each class in the training sample. Write down the feature vectors for each spam and non-spam messages.
(15 points) In the Naive Bayes model, assuming the keywords are independent of each other (this is a simplification), the likelihood of a sentence with its feature vector x given a class c is given by
d
P(x|y = c) = Y θc,kxk , c = {0,1}
k=1
where 0 ≤ θc,k ≤ 1 is the probability of word k appearing in class c, which satisfies
d
X
θc,k = 1, c = {0,1}.
k=1
Given this, the complete log-likelihood function for our training data is given by
m d
`(θ0,1,...,θ0,d,θ1,1,...,θ1,d) = XXx(ki) logθy(i),k
i=1 k=1
(In this example, m = 7.) Calculate the maximum likelihood estimates of θ0,1, θ0,7, θ1,1, θ1,15 by maximizing the log-likelihood function above.
(Hint: We are solving a constrained maximization problem and you will need to introduce Lagrangian multipliers and consider the Lagrangian function.)
(15 points) Given a test message “today is secret”, using the Naive Bayes classier that you have trained in Part (a)-(b), to calculate the posterior and decide whether it is spam or not spam.
3. Neural networks. (Bonus:
Consider a simple two-layer network in the lecture slides. Given m training data (xi,yi), i = 1,...,m, the cost function used to training the neural networks
where σ(x) = 1/(1 + e−x) is the sigmoid function, zi is a two-dimensional vector such that), and ). Show the that the gradient is given by
,
where ui = wTzi. Also show the gradient of `(w,α,β) with respect to α and β and write down their expression.
[1] Garreau, Damien, Wittawat Jitkrittum, and Motonobu Kanagawa. ”Large sample analysis of the median heuristic.” arXiv preprint arXiv:1707.07269 (2017).