Starting from:

$34.99

COMP10200 Assignment 3 Part 2 Solution

Overview
For this assignment, you will obtain some machine learning data for classification, test various SKLearn implementations of Naïve Bayes, and report your findings.
This is Part 2, in which you will obtain a corpus for text classification and apply the Multinomial Naïve Bayes classification algorithm to it using various representations of the text.
The Data
You should use the Reuters-21578 corpus for this. There is a zip file on Canvas with a cleaned-up version of the corpus, some helper code for reading it, and a handout explaining it. Feel free to use this code.
Choose at least 5 binary classification tasks from the Reuters set (i.e. pick five labels and use them to create five binary classification tasks – in each label categorize the texts as 1 or 0, meaning the text gets the label or does not get the label).
The Code
The code you use for this assignment should be written in Python using Numpy and SKLearn. It is expected that you will have to write some of this code yourself, but you are not expected to write everything from scratch. Feel free to adapt the code from Canvas or other sources to suit your needs. You could also explore other Python packages for natural language processing, such as nltk (Natural Language Toolkit). Use the options in the sklearn CountVectorizer for frequent, infrequent, and stop word removal.
The most important requirement for the code you hand in is that it be correctly sourced and documented. You must make it clear where you got the original code from and what modifications, if any, you made to adapt it to your needs.
The Task
Your task is to test the Naïve Bayes classification algorithm against at least 8 different representations of text. Create these 8 different versions by choosing 2 values for each of 3 parameters (words vs. stems, single words vs. n-grams, bag vs. tf-idf, Multinomial vs. Complement NB, word removal vs. no word removal, etc.) and then trying all the permutations. At least one of the 3 parameters you vary should require some manipulation of the list of words (e.g. stemming, word removal, 2-grams, etc.).
Create a standard training/testing split and use the same split to test each of the 8 versions. You do not have to do multiple runs of each version this time. For each version, you should report a combined confusion matrix that summarizes all 5 tasks together, then compute the micro-averaged accuracy,

precision, and recall (micro-averaged just means it’s created from the combined confusion matrix). Here are some ideas for different text representations:
- Bag of words (using Multinomial NB and/or Complement NB)
- Bag of Stems
- Bag of words with some removed (stop words, frequent words, infrequent words, etc.)
- Bag of words and N-Grams (e.g. add 2-word phrases to the feature set)
- The tf-idf representation (using Multinomial NB and/or Gaussian NB) o SKLearn has a utility for computing this representation.
Whatever text representations you choose, make sure that you use Multinomial NB for word counts and Bernoulli NB for binary data.
The Report
You should write a short report, using word processing software, that contains the following sections:
1. Data Description (data set name, source, description of classification task, and a description of how you generated the training and test set split, plus some statistics – number of features on each run, number of items in each class, number of items in training and test set, etc.)
2. Description of the Text Representations (describe the 8 different versions you experimented with, and explain how you created the different text representations)
3. Results (micro-averaged confusion matrix, accuracy, precision, recall, and for each of the 8 versions)
4. Discussion (are there clear winners or losers? Give some solid ideas for why some text representations might be better or not better than others, referencing the properties of your data. Make a recommendation of the best text representation / configuration to use for this data set.)
5. Future Work (If you had more time, where would you go next? What other variations of text representation would you like to explore? What other algorithms or data sets would you like to use? What other tests would you like to do? Etc.)
Throughout your report, make sure you are using standard grammar and spelling, and make sure you make proper use of correct machine learning technology wherever appropriate. If you quote or reference any information about Naïve Bayes or issues in Text Classification that were not explicitly covered in class, you should cite the source for that information using correct APA format.
Report Option: A Video Presentation
If you would prefer, you could also record a video presentation of your results and submit that instead, along with a handout or slides show the results. Your video should include all the information that would be in the report (see items 1 through 5 above). The handout or slides should show all the information from part 3 above (you can use that part of the report template) and any references for quotes or info that you used that wasn’t covered in class (in APA format). When you get to part 3 of the report, you can just refer the viewer to the handout or slides, you don’t have to read them out.
The content of your presentation will be judged using the same rubric as the report would be, just replace the phrase “well written” with “well presented”. Make sure you’re using correct machine learning technology wherever appropriate.
Handing In
Evaluation
This assignment will be evaluated based on: 1. the quality of the report you produce; 2. how well you met the requirements of the assignment; and 3. the quality of the code you handed in (including quality of documentation and referencing within the code).
See the rubric in the drop box for more information.

More products