$30
This assignment is on classification of text documents. It is highly recommended that you use python3 for this assignment as libraries like nltk will make many things easier (stop word removal and lemmatization). However, if you use any other language, you most probably have to design these modules yourselves which might not perform as good as nltk library in python. Also you can use packages from scipy for classification purposes.
Assumptions:Remove stop words, punctuation marks, make everything to lowercase and perform lemmatization to generate tokens from the document (use nltk library in python). Assume positional and class conditional independence of the terms in document for Naive Bayes. For vector space classifications, assume each document is represented by its normalized tf-idf vector representation. The tf-idf vector construction follows the same procedure as stated in assignment 2.
Find your dataset here
1. Naive Bayes with feature selection
(a) Select top x features using mutual information from both train data. Vary x in {1, 10, 100, 1000, 10000}.
(b) Using each of the above x, train a multinomial Naive Bayes on the given train data, with add-one smoothing.
(c) Using each of the above x, train a Bernoulli Naive Bayes on the given train data.
(d) Print F1 score for each of the classifier on the test data for each of the feature value.
2. Vector space classification - Linear: Use Rocchio classifier to classify documents in the test data and print the F1 score. For Rocchio classifier, use the decision rule as follows. Assign d to class c iff |µ(c) − v(d)| < |µ(c¯)−v(d)|−b. Vary b within the range {0, .01, .05, .1} and print F1 score for the b values.
3. Vector space classification - Non linear: Use kNN classifier to classify documents in the test data and report F1 score. Vary k in {1, 10, 50}. For
1
similarity score use inner product of vector representation of two documents. Print F1 scores on test data.