Starting from:

$30

COMP551- MiniProject 2 Solved


Background
In this mini-project you will develop models to analyze text from the website Reddit (https://www.reddit. com/), a popular social media forum where users post and comment on content in different themed communities, or subreddits. The goal of this project is to develop a supervised classification model that can predict what community a comment came from. You will be competing with other groups to achieve the best accuracy in a competition for this prediction task. However, your performance on the competition is only one aspect of your grade. We also ask that you implement a minimum set of models and report on their performance in a write-up.

The Kaggle website has a link to the data, which is a 20-class classification problem with a (nearly) balanced dataset (i.e., there are equal numbers of comments from 20 different subreddits). The data is provided in CSVs, where the text content of the comment is enclosed in quotes. Each entry in the training CSV contains a comment ID, the text of the comment, and the name of the target subreddit for that comment. For the test CSV, each line contains a comment ID and the text for that comment. You can view and download the data via this link: https://www.kaggle.com/c/reddit-comment-classification-comp-551/data

You need to submit a prediction for each comment in the test CSV; i.e., you should make a prediction CSV where each line contains a comment ID and the predicted subreddit for that comment. Since the data is balanced and involves multiple classes, you will be evaluated according to the accuracy score your the model. An example of the proper formatting for the submission file can be viewed at: https://www.kaggle.com/ c/reddit-comment-classification-comp-551/overview/evaluation.

Tasks
You are welcome to try any model you like on this task, and you are free to use any libraries you like to extract features. However, you must meet the following requirements:

•    You must implement a Bernoulli Naive Bayes model (i.e., the Naive Bayes model from Lecture 5) from scratch (i.e., without using any external libraries such as SciKit learn). You are free to use any text preprocessing that you like with this model. Hint 1: you many want to use Laplace smoothing with your Bernoulli Naive Bayes model. Hint 2: you can choose the vocabulary for your model (i.e, which words you include vs. ignore), but you should provide justification for the vocabulary you use.

•    You must run experiments using at least two different classifiers from the SciKit learn package (which are not Bernoulli Naive Bayes). Possible options are:

–    Logistic regression

(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

–    Decision trees

(https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

–    Support vector machines [to be introduced in Lecture 10 on Oct. 7th]

(https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

•    You must develop a model validation pipeline (e.g., using k-fold cross validation or a held-out validation set) and report on the performance of the above mentioned model variants.

•    You should evaluate all the model variants above (i.e., Naive Bayes and the SciKit learn models) using your validation pipeline (i.e., without submitting to Kaggle) and report on these comparisons in your write-up. Ideally, you should only run your “best” model on the Kaggle competition, since you are limited to two submissions to Kaggle per day.

More products