Starting from:

$30

EE5907- Programming Assignment Solved


1.   The pdf file of a well-written, concise project report. The report should NOT be longer than 10 pages (font 12, single space, arial). The report filename MUST be "[name_on_matric_card]_[matric_number]_report.pdf".

2.  Your source code folder.

3.  Readme file containing instructions to run your code. This readme file MUST be inside your source code

4.  Please don’t include project data in your zip folder.

Before you start, take note of the following:

1.   You may discuss the assignment with your classmates, but must write the code completely

on your own. Plagiarism will be severely punished.

2.    You can use matlab or python.

3.    The data (spamData.mat) can be downloaded from the LuminNUS workbin. The data is in matlab format, which can be quite easily read inside python with the right package. If you can’t figure out the right package, you probably should be using matlab :)

4. For all the questions, there are publicly available software libraries that implement someversions of these classifiers. However, implementing the algorithms yourself will help you understand the theory better. Therefore you are expected to implement the classifiers yourself.

5. The evaluation criteria include organization and clarity of the report, correctness of theimplementation, and performance of the classifiers.

6.  Please be considerate to your GA. Be as clear as possible in your submission, e.g., having a clear readme and provide helpful comments in your code. You are more likely to get a better grade if your GA can understand your code and report!

Acknowledgement: This assignment is a variation of a problem from Kevin Murphy’s “Machine Learning: A Probabilistic Perspective” textbook.

Data Description
The data is an email spam dataset, consisting of 4601 email messages with 57 features. Feature descriptions are found in this link. We have divided the data into a training set (3065 emails) and test set (1536 emails) with accompanying labels (1 = spam , 0 = not spam).

Data Processing
One can try different preprocessing of the features. Consider the following separately:

(a)   log-transform: transform each feature using log(xij + 0.1) (assume natural log)

(b)  binarization: binarize features: I(xij > 0). In other words, if a feature is greater than 0, it’s simply set to 1. If it’s less than or equal to 0, it’s set to 0.

Q1. Beta-binomial Naive Bayes 
Fit a Beta-Binomial naive Bayes classifier on the binarized data from the Data Processing section. Since there are a lot of spam and non-spam emails, you do not need to assume any prior on the class label. In other words, the class label prior λ can be estimated using ML and you can use λML as a plug-in estimator for testing.

On the other hand, you should assume a prior Beta(α,α) on the feature distribution (note that the two hyperparameters for the Beta prior are set to be the same). For each value of α = {0,0.5,1,1.5,2,··· ,100}, fit the classifier on the training data and compute its error rate (i.e., percentage of emails classified wrongly) on the test data. For the features (i.e., when computing p(x|y)), please use Bayesian (i.e., posterior predictive) training and testing (see week 3 lecture notes on “Predicting Target Class of Test Data ˜x Using Posterior Predictive

Distribution”).

Make sure you include at least the following in your report:

•     Plots of training and test error rates versus α

•     What do you observe about the training and test errors as α change?

•     Training and testing error rates for α = 1, 10 and 100.

Q2. Gaussian Naive Bayes 
Fit a Gaussian naive Bayes classifier on the log-transformed data from the Data Processing section. Since there are a lot of spam and non-spam emails, you do not need to assume any prior on the class label. In other words, the class label prior λ can be estimated using ML and you can use λML as a plug-in estimator for testing.

For this exercise, just use maximum likelihood to estimate the class conditional mean and variance of each feature and use ML estimates as a plug-in estimator for testing (see week 3 lecture notes on “ML estimation of µ,σ2” and “Predicting Target Class of Test Data ˜x” for Strategies 1 and 2) . Make sure you include the following in your report:

• Training and testing error rates for the log-transformed data.

Q3. Logistic regression 
For the log-transformed data, fit a logistic regression model with l2 regularization (see week 4 lecture notes on “Newton’s Method for Logistic Regression” and “Exclude Bias from l2 Regularization”). For each regularization parameter value λ = {1,2,··· ,9,10,15,20,··· ,95,100} (note the jump in interval from 10 to 15 and beyond), fit the logistic regression model on the training data and compute its error rate (i.e., percentage of emails classified wrongly) on the test data. Make sure you include at least the following in your report:

•     Plots of training and test error rates versus λ

•     What do you observe about the training and test errors as λ change?

•     Training and testing error rates for λ = 1, 10 and 100.

Don’t forget to include the bias term in the logistic regression and your l2 regularization should not apply to the bias term.

Q4. K-Nearest Neighbors 
For the log-transformed data, implement a KNN classifier (see week 5 lecture notes on “Nonparametric Classification”). Use the Euclidean distance to measure distance between neighbors.

For each value of K = {1,2,··· ,9,10,15,20,··· ,95,100} (note the jump in interval from 10 to 15 and beyond), compute the training and test error rates (i.e., percentage of emails classified wrongly). Make sure you include at least the following in your report:

•     Plots of training and test error rates versus K

•     What do you observe about the training and test errors as K change?

•     Training and testing error rates for K = 1, 10 and 100.

More products