COMP 472 Mini-Project 1
Experiments with Machine Learning
For this mini-project, you will experiment with different machine learning algorithms and different data sets. As you will use built-in functions from the scikit-learn Library, the focus of this mini-project lies more on the experimentations and analysis than on the implementation.
1 Your Tasks
You will perform 2 tasks: a text classification task to better understand the Multinomial Naive Bayes classifier, and another classification with a variety of types of features to better appreciate how to work with other types of data and machine learning models. You must use:
1. Python 3.8 and the scikit-learn library. Scikit-learn (see http://scikit-learn.org/stable/) provides an interface to program with a variety of different algorithms and built-in datasets. There are plenty of tutorials and examples of code online.
2. GitHub (make sure your project is private while developing).
2 Task 1: Text Classification
1. Download the BBC dataset provided on Moodle. The dataset, created by [Greene and Cunningham, 2006], is a collection of 2225 documents from the BBC news website already categorized into 5 classes: business, entertainment, politics, sport, and tech.
2. Plot the distribution of the instances in each class and save the graphic in a file called BBC-distribution.pdf. You may want to use matplotlib.pyplot and savefig to do this. This pre-analysis of the data set will allow you to determine if the classes are balanced, and which metric is more appropriate to use to evaluate the performance of your classifier.
3. Load the corpus using load files and make sure you set the encoding to latin1. This will read the file structure and assign the category name to each file from their parent directory name.
4. Pre-process the dataset to have the features ready to be used by a multinomial Naive Bayes classifier. This means that the frequency of each word in each class must be computed and stored in a term-document matrix. For this, you can use feature extraction.text.CountVectorizer.
5. Split the dataset into 80% for training and 20% for testing. For this, you must use train test split with the parameter random state set to None.
6. Train a multinomial Naive Bayes Classifier (naive bayes.MultinomialNB) on the training set using the default parameters and evaluate it on the test set.
7. In a file called bbc-performance.txt, save the following information: (to make it easier for the TAs, make sure that your output for each sub-question below is clearly marked in your output file, using the headings (a), (b) ...)
(a) a clear separator (a sequence of hyphens or stars) and string clearly describing the model (e.g. “MultinomialNB default values, try 1”)
(b) the confusion matrix (you can use confusion matrix)
(c) the precision, recall, and F1-measure for each class (you can use classification report)
(d) the accuracy, macro-average F1 and weighted-average F1 of the model (you can use accuracy score and f1 score)
(e) the prior probability of each class
(f) the size of the vocabulary (i.e. the number of different words[1])
(g) the number of word-tokens in each class (i.e. the number of words in total2)
(h) the number of word-tokens in the entire corpus
(i) the number and percentage of words with a frequency of zero in each class
(j) the number and percentage of words with a frequency of zero in the entire corpus
(k) your 2 favorite words (that are present in the vocabulary) and their log-prob
8. Redo steps 6 and 7 without changing anything (do not redo step 5, the dataset split). Change the model name to something like “MultinomialNB default values, try 2” and append the results to the file bbc-performance.txt.
9. Redo steps 6 and 7 again, but this time, change the smoothing value to 0.0001. Append the results at the end of bbc-performance.txt.
10. Redo steps 6 and 7, but this time, change the smoothing value to 0.9. Append the results at the end of bbc-performance.txt.
11. In a separate plain text file called bbc-discussion.txt, explain in 1 to 2 paragraphs:
(a) what metric is best suited to this dataset/task and why (see step (2))
(b) why the performance of steps (8-10) are the same or are different than those of step (7) above.
In total, you should have 3 output files for task 1: bbc-distribution.pdf, bbc-performance.txt, and bbc-discussion.txt.
3 Task 2: Drug Classification
1. Download the Drug dataset on Moodle. This dataset, in csv format, contains features that are numerical, categorical and ordinal as well as one of 5 classes to predict: DrugA, DrugB, DrugC, DrugX, or DrugY.
2. Load the dataset in Python (you can use pandas.read csv).
3. Plot the distribution of the instances in each class and store the graphic in a file called drug-distribution.pdf. You can use matplotlib.pyplot. This pre-analysis will allow you to determine if the classes are balanced, and which metric is more appropriate to use to evaluate the performance of your classifier.
4. Convert all ordinal and nominal features in numerical format. Make sure that your converted format respects the ordering of ordinal features, and does not introduce any ordering for nominal features. You may want to take a look at pandas.get dummies and pandas.Categorical to do this.
5. Split the dataset using train test split using the default parameter values.
6. Run 6 different classifiers:
(a) NB: a Gaussian Naive Bayes Classifier (naive bayes.GaussianNB) with the default parameters.
(b) Base-DT: a Decision Tree (tree.DecisionTreeClassifier) with the default parameters.
(c) Top-DT: a better performing Decision Tree found using (GridSearchCV). The gridsearch will allow you to find the best combination of hyper-parameters, as determined by the evaluation function that you have determined in step (3) above. The hyper-parameters that you will experiment with are:
• criterion: gini or entropy
• max depth : 2 different values of your choice
• min samples split: 3 different values of your choice
(d) PER: a Perceptron (linear model.Perceptron), with default parameter values.
(e) Base-MLP: a Multi-Layered Perceptron (neural network.MLPClassifier) with 1 hidden layer of 100 neurons, sigmoid/logistic as activation function, stochastic gradient descent, and default values for the rest of the parameters.
(f) Top-MLP: a better performing Multi-Layered Perceptron found using grid search. For this, you need to experiment with the following parameter values: • activation function: sigmoid, tanh, relu and identity
• 2 network architectures of your choice: for eg 2 hidden layers with 30+50 nodes, 3 hidden layers with 10 + 10 + 10
• solver: Adam and stochastic gradient descent
7. For each of the 6 classifier above, append the following information in a file called drugs-performance.txt: (to make it easier for the TAs, make sure that your output for each sub-question below is clearly marked in your output file, using the headings (a), (b) ...)
(a) a clear separator (a sequence of hyphens or stars) and a string clearly describing the model (e.g. the model name + hyper-parameter values that you changed). In the case of Top-DT and Top-MLP, display the best hyperparameters found by the gridsearch.
(b) the confusion matrix
(c) the precision, recall, and F1-measure for each class
(d) the accuracy, macro-average F1 and weighted-average F1 of the model
8. Redo steps 6, 10 times for each model and append the average accuracy, average macro-average F1, average weighted-average F1 as well as the standard deviation for the accuracy, the standard deviation of the macro-average F1, and the standard deviation of the weighted-average F1 at the end of the file drugs-performance.txt. Does the same model give you the same performance every time? Explain in a plain text file called drugs-discussion.txt. A 1 or 2 paragraph discussion is expected.
In total, you should have 3 output files for task 2: drug-distribution.pdf, drug-performance.txt, and drug-discussion.txt.