$25
1 Decision Trees
In this exercise you are to implement a decision tree classifier. You are not allowed to use an implementation of decision tree classifiers from libraries such as sklearn. You are not allowed to copy code you have found online. Read the entire exercise before writing any code or answering any questions.
a) Implement a decision tree that can support categorical variables. This algorithm is described in Figure 18.5 in AIMA. Use information gain as the Importance function. Use your implementation to train a decision tree on the Titanic dataset (see below). Some of the columns in the dataset are continuous attributes, and cannot be used with this implementation. Which are these? Test your model on the test set and print your accuracy. Add your decision tree to the PDF manually, using Graphviz, or with some other software.
b) Extend your implementation to also support continuous variables. Train this model on the Titanic dataset again, now also using the continuous attributes you left out in a). Test the model again on the test set and print the accuracy of the predictions. Add your decision tree to the PDF manually, using Graphviz, or with some other software.
c) Discuss the results from a) and b); is the performance difference what you expected? What might be done to improve the performance of the decision tree classifier on the test set? Suggest at least 2 changes that might be made to the algorithm that could improve performance. Argue why your changes could improve performance. You do not have to show that it actually does improve accuracy on the Titanic dataset.
Continous variables
Splitting on continuous variables is different from categorical variables. You cannot make a branch for all possible values, instead we find the best value to split on and make two branches, one for numbers smaller than the split, and one for numbers larger than the split. Because you are only making two branches at each split for continuous attributes, typically we let the decision tree split on those attributes multiple times. For your implementation in b) we only require you to be able to split on the continuous attributes once, just like with the categorical attributes. This makes it so that the tree will never become deeper than the number of attributes in the data.
Titanic dataset
The Titanic dataset contains information on passengers aboard RMS Titanic. The idea is to predict the survival of each passenger based on other collected information.
The dataset contains the following columns:
Survival:
Did the passenger survive (0=No, 1=Yes)
Pclass:
Ticket class (1=1st, 2=2nd, 3=3rd)
Name:
The name of the passenger
Sex:
Sex
Age:
Age in years
SibSp:
Number of siblings/spouses aboard the Titanic
Parch:
Number of parents/children aboard the Titanic
Ticket:
Ticket number
Fare:
Passenger fare
Cabin:
Cabin number
Embarked:
Port of embarkation (C=Cherbourg, Q=Queenstown, S=Southampton)
Not all columns are suitable predictors, however. Use only the columns you find suitable for predicting survival rate and explain why you disregard the other attributes.
Data is split into two files:
• Training set (train.csv)
• Test set (test.csv)
The training set includes all columns listed above. The test set predictors include all columns except Survival, and the test set predictors only contains Survival for the individuals in the test data. Furthermore, some of the columns contain missing values. You do not need to implement support for missing values, so you can disregard these attributes when training your decision tree.
We have given you a modified version of the Titanic dataset available at https:
//www.kaggle.com/c/titanic/data.
Tips
Use a dataframe library to load the csv files. Pandas is a great option if you are using Python. Other Pandas methods that can come in handy when implementing the decision tree include DataFrame.value counts, DataFrame.groupby, DataFrame.map, and DataFrame.sort values.
2 Missing Values
One of the columns in the Titanic dataset contained missing values. We chose to not use these columns at all when we implemented out decision tree, even though only very few values were missing. This means that we missed out on a lot of information that could have made our decision tree perform. Suggest at least two methods that we could have used to handle missing data either during training or prediction. Describe your proposed solutions either in text or with pseudo code. Discuss the advantages and disadvantages of your methods. What assumptions are you making?