$25
Project 4: Movie Review Sentiment Analysis
You are provided with a dataset consists of IMDB movie reviews, where each review is labelled as positive or negative. The goal is to build a binary classification model to predict the sentiment of a movie review.
Source
Download the dataset, "Project4_data.tsv", from the Resouces page. The dataset has 50,000 rows (i.e., reviews) and 3 columns. Column 1 "new_id" is the ID for each review (same as the row number), Column 2 "sentiment" is the binary response, and Column 3 is the review.
Download "Project4_splits.csv" from the Resouces page. The dataset has 3 columns, and each column consists of the "new_id" (or equivalently the row ID) of 25,000 test samples. That is, using "Project4_splits.csv", we can split the 50,000 reviews into Three sets of training and test data.
Performance Target. Evaluation metric is AUC on the test data. Full credit for submissions with minimum AUC over three test data equal to or bigger than 0.96.
A subset of this data set was used in a Kaggle competition: Bag of Words Meets Bags of Popcorn. You can check how others analyze such data and try some sample code on Kaggle.