$30
Background
In this miniproject you will implement naive Bayes and K-fold cross-validation from scratch, while using logistic regression [1] from scikit-learn package (or optionally implemented from scratch) – and compare these two algorithms on two distinct textual datasets. The goal is to gain experience implementing these algorithms from scratch and to get hands-on experience comparing performance of different models.
Task 1: Acquire and preprocess the data
Your first task is to acquire the data and clean it (if necessary). To turn the text data into numerical feature, you should use the bags of words representation using the scikit-learn function CountVectorizer[2] for more details on the function. Please use these online resources as guidelines and avoid direct copy and pasting from them for your project, make sure to understand each line and try to write it yourself when using these guides.
We will use two datasets in this project, outlined below.
• 20 news group dataset. Use the default train subset (subset=‘train’, and remove=([‘headers’, ‘footers’, ‘quotes’]) in sklearn.datasets) to train the models and report the final performance on the test subset. Note: you need to start with the text data and convert text to feature vectors. Please refer to https://scikit-learn.org/ stable/tutorial/text_analytics/working_with_text_data.html for a tutorial on the steps needed for this. You can refer to this tutorial for further tips on how to do data-processing with text data.
• IMDB Reviews: http://ai.stanford.edu/˜amaas/data/sentiment/ Here, you need to use only reviews in the train folder for training and report the performance from the test folder. You need to work with the text documents to build your own features and ignore the pre-formatted feature files.
You are free to use any Python libraries you like to extract features and preprocess the data.
Task 2: Implement Naive Bayes and k-fold cross validation
You are free to implement these models as you see fit, but you should follow the equations that are presented in the lecture slides, and you must implement the models from scratch (i.e., you cannot use SciKit Learn or any other pre-existing implementations of these methods).
In particular, your two main tasks in the part are to:
1. Implement naive Bayes, using the appropriate type of likelihood for features.
2. Implementing k-fold cross validation.
You are free to implement these models in any way you want, but you must use Python and you must implement the models from scratch (i.e., you cannot use SciKit Learn or similar libraries). Using the numpy package, however, is allowed and encouraged. Regarding the implementation, we recommend the following approach:
• Implement naive Bayes model as a Python class. You should use the constructor for the class to initialize the modelparameters as attributes, as well as to define other important properties of the model.
• Your model class should have (at least) two functions:
– Define a fit function, which takes the training data (i.e., X and y)—as well as other hyperparameters (e.g., the learning rate and/or number of gradient descent iterations)—as input. This function should train your model by modifying the model parameters.
– Define a predict function, which takes a set of input points (i.e., X) as input and outputs predictions (i.e., yˆ) for these points.
• In addition to the model classes, you should also define a functions evaluate_acc to evaluate the model accuracy. This function should take the true labels (i.e., y), and target labels (i.e., yˆ) as input, and it should output the accuracy score.
• Lastly, you should implement a script to run k-fold cross-validation for hyper-parameter tuning and model selection.
Your implementation should have (at least) three parts below:
– Define a cross_validation_split function, which takes training data as input and splits it into k folds. Each fold is then used once as validation while the k −1 remaining folds form the training set. (You may shuffle the data before splitting)
– Define function kfoldCV, which takes train/validation sets generated above and a given model as input. The function iterate through each train/validation set and returns the average result of each fold.
– Repeat functions above to the given model with different hyper-parameter settings, select the best hyperparameter by the best average k-fold result.
You are free to use any Python libraries you like to tune the hyper-parameters, for example see https://scikitlearn.org/stable/auto_examples/model_selection/plot_randomized_search.html#sphx-glr-autoexamples-model-selection-plot-randomized-search-py.
Task 3: Run experiments
The goal of this project is to have you explore linear classification and compare different features and models. Use 5-fold cross validation to estimate performance in all of the experiments. Evaluate the performance using accuracy. You are welcome to perform any experiments and analyses you see fit (e.g., to compare different features), but at a minimum you must complete the following experiments in the order stated below:
1. We expect you to conduct multiclass classification on the 20 news group dataset, and binary classification onthe IMDb Reviews.
2. In a single table, compare and report the performance of the performance of naive Bayes and logistic regressionon each of the two datasets (with their best hyperparameters), and highlight the winner for each dataset and overall. You could find full hyper-parameters for logistic regression here. [3]
3. Further, with a plot, compare the accuracy of the two models as a function of the size of dataset (by controllingthe training size). For example, you can randomly select 20%,40%,60% and 80% of the available training data and train your model on this subset. Now, compare the performance of corresponding models and highlight the best. [4].
Note: The above experiments are the minimum requirements that you must complete; however, this project is open-ended. For this part, you might implement logistic regression from scratch (and try different learning rates, investigate different stopping criteria for the gradient descent), try linear regression for predicting ratings in the IMDB data, try different text embedding methods as alternatives to bag of words. You are also welcome and encouraged to try any other model covered in the class, and you are free to implement them yourself or use any Python library that has their implementation, e.g. the from the scikit-learn package. Of course, you do not need to do all of these things, but look at them as suggestions and try to demonstrate curiosity, creativity, rigour, and an understanding of the course material in how you run your chosen experiments and how you report on them in your write-up.