Starting from:

$25

CSE537 - Project 03 -  Machine Learning   -Solved

1.   Clickstream Mining with Decision Trees [50 points]
The project is based on a task posed in KDD Cup 2000. It involves mining click-stream data collected from Gazelle.com, which sells legware products. Your task is to determine: Given a set of page views, will the visitor view another page on the site or will he leave?

The data set given to you is in a different form than the original. In particular it has discretized numerical values obtained by partitioning them into 5 intervals of equal frequency. This way, we get a data set where for each feature, we have a finite set of values. These values are mapped to integers, so that the data is easier for you to handle

Dataset

The data set can be found on Blackboard.

You have 5 files in .csv format

1.     trainfeat.csv: Contains 40000 examples, each with 274 features in the form of a 40000 x 274 matrix.

2.     trainlabs.csv: Contains the labels (class) for each training example (did the visitor view another page?)

3.     testfeat.csv: Contains 25000 examples, each with 274 features in the form of a

25000 x 274 matrix.

4.     testlabs.csv: Contains the labels (class) for each testing example.

5.     featnames.csv: Contains the "names" of the features. These might useful if you need to know what the features represent.

The format of the files is not complicated, just rows of integers separated by empty spaces.

Stopping Criterion: Chi-squared criterion
What about split stopping? Suppose that the attribute that we want to split on is irrelevant. Then we would expect the positive and negative examples (labels 1 and 0) to be distributed according to a specific distribution. Suppose that splitting on attribute T, will produce sets {T_i} where i = 1 to m

Let p, n denote the number of positive and negative examples that we have in our dataset (not the whole set, the remaining one that we work on at this node). Let (N is the total number of examples in the current dataset):

 

be the expected number of positives and negatives in each partition, if the attribute is irrelevant to the class. Then the statistic of interest is:

 

where p_i, n_i are the positives and negatives in partition T_i. The main idea is that S is distributed (if the class is irrelevant) according to a chi-squared distribution with m-1 degrees of freedom.

Now we must compute the p-value. This is the probability of observing a value X at least as extreme as S coming from the distribution. To find that, we compute P(X >= S). The test is passed if the p-value is smaller than some threshold. Remember, the smallest that probability is, the more unlikely it is that S comes from a chi-squared distribution and consequently, that T is irrelevant to the class.

Your Task:

Code
Implement the ID3 decision tree learner on Python. Your program should use the chi-squared split stopping criterion with the p-value threshold given as a parameter. Use your implementation with the threshold for the criterion set to 0.05, 0.01 and 1. Remember, 1 corresponds to growing the full tree.



2.   Spam Filter [50 points]
The dataset we will be using is a subset of 2005 TREC Public Spam Corpus. It contains a training set and a test set. Both files use the same format: each line represents the space-delimited properties of an email, with the first one being the email ID, the second one being whether it is a spam or ham (non-spam), and the rest are words and their occurrence numbers in this email. In preprocessing, non-word characters have been removed, and features selected similar to what Mehran Sahami did in his original paper using Naive Bayes to classify spams.

Dataset

The data set can be found on Blackboard.

Your Task:

Code

Implement the Naive Bayes algorithm to classify spam.

Report
Use your algorithm to learn from the training set and report accuracy on the test set. Try various smoothing parameters for the Naive Bayes learner.

Which parameters work best?

Extra Credit (10 points)
Features selected makes learning much easier, but it also throws out useful information. For example, exclamation mark (!) often occurs in spams. Even the format of email sender matters: in the case when an email address appears in the address book, a typical email client will replace it with the contact name, which means that the email is unlikely to be a spam (unless, of course, you are a friend of the spammer!). Sahami's paper talked about a few such features he had used in his classifier. For extra credit, you can play with the original files and come up with useful features that improve your classifier. Index.zip (on Blackboard) contains the list of the files used in train/test.


More products