$40
Introduction to Machine Learning Program Assignment #2 -
Decision Tree & Random Forest & KNN
This programming assignment requires you to understand and implement Decision Tree, Random Forest, and K-Nearest Neighbor algorithms.
Before we start
You may choose to go to the PC classrooms or finish this HW elsewhere.
It won’t affect a thing. For fairness’ sake, we’ll use discord as the Q&A system.
Join the discord server for TA support
This is the same one as the program assignment #1 server.
Ask questions on it, and we shall reply. (We won’t respond to raised hands.)
Try not to ask for obvious answers or bug fixes.
Memes and chit chat welcome
Objective
Data Input - 5%
Data Preprocessing - 5%
Transform data format and shape so your model can process them.
Shuffle the data.
Transform label format so you can do the required two tasks described below Data section.
Model Construction - 30%
For all the models, you need to do two tasks described below in the data section.
The data consists of both categorical and numerical features, and you have to treat them differently.
Three models must be constructed, Decision Tree, Random Forest, and K-Nearest Neighbor.
For the Decision Tree model, you may use the following ID3 algorithm pseudocode. - 10%
2. ID3 (Examples, Target_Attribute, Attributes)
3. Create a root node for the tree
4. If all examples are positive, Return the single-node tree Root, with label = +.
5. If all examples are negative, Return the single-node tree Root, with label = -.
6. If the number of predicting attributes is empty, then Return the single node tree Root,
7. with label = most common value of the target attribute in the examples.
8. Otherwise Begin
9. A ← The Attribute that best classifies examples.
10. Decision Tree attribute for Root = A.
11. For each possible value, vi, of A,
12. Add a new tree branch below Root, corresponding to the test A = vi.
13. Let Examples(vi) be the subset of examples that have the value vi for A
14. If Examples(vi) is empty
15. Then below this new branch add a leaf node with label = most common target value in the examples
16. Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
17. End
18. Return Root
Note that you could implement any decision tree algorithm, not restrict to ID3. But you need to clarify which algorithm you used.
For the Random Forest model, you must construct multiple Decision Tree models from randomly selected data (from the training subset) and perform voting for prediction. - 10%
For the data selection, the following methods are all acceptable. You could choose one to implement.
Randomly select features
Randomly select samples
Both
The number of trees must be greater than or equal to 3. You need to try at least 3 different numbers of trees and compare the result.
Understand the difference between K-fold cross-validation and Random Forest. Confuse one with another, and you won’t get this part of the score.
For the KNN model, you need to modify categorical features to calculate distance. And you may need to normalize every feature to let your KNN model work as expected. - 10%
You need to try at least 3 different K values and compare their results.
Validation - 5%
Two validation methods need to be implemented.
Holdout validation with the ratio 7:37:3
K-fold cross-validation with K=3K=3
Obtain the final performance by averaging all folds’ performance.
Results - 10%
Obtain the performances of all experiment settings in tables by the following metrics:
Confusion matrix
Accuracy
Sensitivity(Recall)
Precision
Comparison & Conclusion - 5%
Also some feedback, anything you want to tell me.
Questions - 40%
Decision Tree
Show the prediction and reasoning of 1-samples in the validation set. - 10%
Random Forest
Describe the difference between boosting and bagging. - 10%
KNN
Pick 2 features, draw and describe the KNN decision boundaries. - 10%
You can pick 2 features to re-train the model, or just fix every other feature values.
Show the prediction and reasoning of 1-samples in the validation set. - 10%
Finish during class - 20%
Submit your report and source codes to the newE3 system before class ends.
Finish time will be determined by the submission time.
Data - Student Performance Data Set
Data can be downloaded here:
http://archive.ics.uci.edu/ml/datasets/Student+Performance
Please NOTE that the last column is the label (G3)
Two datasets provided (Mathematics, Portuguese language) are both acceptable. You could choose one to analyze.
Followed by this paper, You will have to do 2 classification tasks:
Binary classification - pass if G3≥10G3≥10, else fail
5-Level classification - based on the Erasmus grad conversion system.
Country
I (excellent/very good)
II (good)
III (satisfactory)
IV (sufficient)
V (fail)
Portugal/France
16-20
14-15
12-13
10-11
0-9
Ireland
A
B
C
D
F
Data Set Information
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Attribute Information
school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - 1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - 10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)
Submission & Scoring Policy
Please submit a zip file, which contains the following, to the newE3 system.
Report
Explanation of how your code works.
All the content mentioned above.
Your name and student ID at the very beginning - 10%
Accept formats: HTML
Source codes
Accept languages: python3
Accept formats: .ipynb
Package-provided models are allowed
Your score will be determined mainly by the submitted report.
if there’s any problem with your code, TA might ask you (through email) to demo it. Otherwise, no demo is needed.
Scores will be adjusted at the end of the semester for them to fit the school regulations.
Plagiarizing is not allowed.
You will get ZERO on that homework if you get caught the first time.
The second time, you’ll FAIL this class.
Acknowledgments
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Tools that might be useful
Jupyter Lab, pre-installed in PC classrooms
Numpy - Math thingy
matplotlib - Plot thingy
pandas - Data thingy
scipy - Science thingy
scikit-learn - Machine Learning and stuff