Naive Bayes (Multinomial) for sentiment analysis
In this assignment you will implement the Naive Bayes (Multinomial) classifier for Sentiment Analysis of an IMDB movie review dataset (a highly polarized dataset with 50000 movie reviews). The primary task is to classify the reviews into negative and positive.
More about the dataset: http://ai.stanford.edu/˜amaas/data/sentiment/
For the Multinomial model, each document is represented by a vector of integer-valued variables, i.e., x = (x1,x2,...,x|V |)T and each variable xi corresponds to the i-th word in a vocabulary V and represents the number of times it appears in the document. The probability of observing a document x given its class label y is defined as (for example, for y = 1):
|V | p(x|y = 1) = YP(wi|y = 1)xi
i=1
Here we assume that given the class label y, each word in the document follows a multinomial distribution of |V | outcomes and P(wi|y = 1) is the probability that a randomly selected word is word i for a document of the positive class. Note that Pi=1 P(wi|y) = 1 for y = 0 and y = 1. Your implementation need to estimate p(y), and P(wi|y) for i = 1,···,|V |, and y = 1,0 for the model. For p(y), you can use the MLE estimation. For P(wi|y), you MUST use Laplace smoothing for the model. One useful thing to note is that when calculating the probability of observing a document given its class label, i.e., p(x|y), it can and will become overly small because it is the product of many probabilities. As a result, you will run into underflow issues. To avoid this problem, your implementation should operate with log of the probabilities.
1 Description of the dataset
The data set provided are in two parts:
• IMDB.csv: This contains a single column called Reviews where each row contains a movies review. There are total of 50K rows. The first 30K rows should be used as your Training set (to train your model). The next 10K should be used as the validation set (use this for parameter tuning). And the last 10K rows should be used as the test set (predict the labels).
• IMDB labels.csv: This contains 40K labels. Please use the first 30K labels for the training data and the last 10K labels for validation data. The labels for test data is not provided, we will use that to evaluate your predicted labels.
2 Data cleaning and generating BOW representation
Data Cleaning. Pre-processing is need to makes the texts cleaner and easy to process. The reviews columns are comments provided by users about the movie. These are known as ”dirty text” that required further cleaning. Typical cleaning steps include a) Removing html tags; b) Removing special characters; c) Converting text to lower case d) replacing punctuation characters with spaces; and d) removing stopwords i .e . articles, pronouns from consideration. You will not need to implement these functionalities and we will provide some starter code containing these functions for you to use.
Generating BOW representation. To transform from variable length reviews to fixed-length vectors, we use the Bag Of Words technique. It uses a list of words called ”vocabulary”, so that given an input text we can output a vector of word counts for each word in the vocabulary. You can use the CountVectorizer functionality from sklearnstarter to go over the full 50K reviews to generate the vocabulary and create the feature vectors representing each review. Not that the CountVectorizer function has several tunable parameters that can directly impact the result feature representation. This includes max features :, which specifies the maximum number of features (by considering terms with high frequency); max df and min df, which filter the words from the dictionary if its document frequency is too high ( max df) or too low (< min df) respectively.
3 What you need to do
1. Apply the above described data cleaning and feature generation steps to generate the BOW representation for all 50k reviews. For this step, we will use the default value for max df and min df and set max features = 2000.
2. Train a multi-nomial Naive Bayes classifier with Laplace smooth with α = 1 on the training set. This involves learning P(y = 1),P(y = 0), P(wi|y = 1) for i = 1,...,|V | and P(wi|y = 0) for i = 1,...,|V | from the training data (the first 30k reviews and their associated labels).
3. Apply the learned Naive Bayes model to the validation set (the next 10k reviews) and report the validation accuracy of the your model. Apply the same model to the testing data and output the predictions in a file, which should contain a single column of 10k labels (0 (negative) or 1 (positive)). Please name the file test-prediction1.csv.
4. Tuning smoothing parameter alpha. Train the Naive Bayes classifier with different values of α between 0 to 2 (incrementing by 0.2). For each alpah value, apply the resulting model to the validation data to make predictions and measure the prediction accuracy. Report the results by creating a plot
with value of α on the x-axis and the validation accuracy on the y-axis. Comment on how the validation accuracy change as α changes and provide a short explanation for your observation. Identify
the best α value based on your experiments and apply the corresponding model to the test data and output the predictions in a file, named test-prediction2.csv.
5. Tune your heart out. For the last part, you are required to tune the parameters for the CountVectorizer (max feature, max df, and min df). You can freely choose the range of values you wish[1] to test for these parameters and use the validation set to select the best model. Please describe your strategy for choosing the value ranges and report the best parameters (as measured by the prediction accuracy on the validation set) and the resulting model’s validation accuracy. You are also required to apply the chosen best model to make predictions for the testing data, and output the predictions in a file, named test-prediction3.csv.
[1] You are encouraged to try your best to tune these parameters. Higher validation accuracy and testing accuracy will be rewarded with possible bonus points.