$25
The goal of this assignment is to familiarize you with classification systems in general and with decision tree classifiers in particular.
### What to do -- decision_tree.py
You are asked to write a Python program, called `decision_tree.py` that will
1. read a decision tree (stored in a plain text file),
2. read a test data set (stored in a csv file, with the first row having the variable names), and
3. evaluate the test data using the provided decision tree and provide statistics.
Your program should be invoked as:
```
python3 decision_tree.py tree.txt test.csv
```
### (1) Decision Tree Format
The decision tree will be provided as a text file and will essentially be the output from the ID3 Decision Tree Classifier.
A sample decision tree is provided below and also included as file `tree.txt` within this repository:
```
color black: bad (2)
color blue
| fruit blueberries: good (2)
| fruit grapes: bad (1)
color green
| fruit blueberries: bad (2)
| fruit grapes: good (2)
color red
| fruit blueberries: bad (1)
| fruit grapes: good (1)
```
The above was generated by the `treegen.py` program which is also included within this repository. The format is fairly straightforward: the above tree corresponds to a two-level decision tree, with `color` being the first variable (valid options: `black`, `blue`, `green`, and `red`) and `fruit` the second variable (valid options: `blueberries` and `grapes`). There are only two labels: `good` and `bad`. The numbers in parentheses denote how many samples each rule was built upon.
Your program **should handle decision trees up to 3 levels deep**.
Please note that although you are encouraged to experiment with the `decision-tree-id3` module (https://svaante.github.io/decision-tree-id3/index.html), used by the `treegen.py` program, as part of preparing your assignment, you are **not allowed to use the decision-tree-id3 module in your submission**.
### (2) Test Data Format
The test data set will be provided as a CSV file. The first row will contain the variable names. A sample test data file, named `test.csv`, is provided within this repository. The first 3 lines of the file are shown below:
```
"day_of_week", "fruit", "color"
"mon", "blueberries", "black"
"mon", "blueberries", "blue"
```
Please note that the number of variables in the test data set is greater than or equal to the number of variables specified in the decision tree file. In this example, `day_of_week` was not part of the decision tree.
### (3) How to evaluate the decision tree
Given the decision tree and the test data input files, you are asked to do two things:
1. for each row in the test data set, find which rule from the decision tree it will match against, and
2. keep track of how many times each rule in the decision tree was matched and print these statistics
Your program should only print the statistics for all rules and it must follow the same format as in the decision tree format.
For example, the correct output for running your program with the provided `tree.txt` and `test.csv` files should be the following (included in the repository as `output.txt`):
```
color black: bad (6)
color blue
| fruit blueberries: good (3)
| fruit grapes: bad (2)
color green
| fruit blueberries: bad (4)
| fruit grapes: good (5)
color red
| fruit blueberries: bad (2)
| fruit grapes: good (4)
UNMATCHED: 1
```
Note that you must include a line at the end if the test data contain rows that were not matched by any decision tree rules.
**Important Hint** In order to solve this assignment, you are strongly encouraged to read the documentation for the `exec()` python command
https://docs.python.org/3/library/functions.html#exec
### Important: special-cases.txt
If you do something in your code that you would consider a special case, then you are requested to submit an extra file, along with your submission, named `special-cases.txt`, where you described in plain text what the special case(s) is/are and how you handled it/them in your program. We will use this mechanism instead of asking such questions in piazza.