Starting from:

$25

CS 434 assignment 3-  Decision Tree for Predicting Election Results by US County Solved

Decision Tree Ensemble for Predicting Election Results by US

County Statistics


In this assignment we will work on a task to classify United States counties between Democratic and Republican majority vote in the 2016 US Election. We do this based on a set of discrete features related to each county. These features are a categorical representation of continuous features, which are provided in the data description (county dictionary) file. The goal in this assignment is to develop variations of the decision tree algorithm and ensemble methods.

Data. The data for this assignment is taken from a collection of US election data in the primary elections from 2016. The data has already been preprocessed for you. Here is a short description of each train and validation split:

(a)    Train Set Features/Labels (x train.csv, y train.csv): Includes 2098 rows (samples, one for each county) in each file. Each sample in X contains 51 categorical features related to bins of continuous data encompassing various statistics of a given county. If you’re curious, I can provide what the bins are, but it’s not necessary for the assignment. Just think ”quantiles” of some sort and you’ll be fine understanding it.

(b)    Test Set Features/Labels (x test.csv, y test.csv): Includes 700 rows. Each row obeys the same format given for the train set. This set will be used to see the performance of the models.

Important Guidelines.           For all parts of this assignment:

(a)    We assigned labels already for the class 1 to Republican and 0 to Democrat. Be sure they aren’t modified when making predictions, as the starter code assumes 0/1 labels, not -1/1 labels.

(b)    Please do not add bias to the features.

(c)    Please use the base classes provided for your implementations and modifications. You may add more functions if needed, but ensure the class structure preserves the format in the starter code.

(d)    NOTE: This data is class imbalanced. That is, we have about a 2:1 ratio of positive to negative class. This will affect how we quantify predictions, as we may need to use more than just raw accuracy for model selection. You will be asked to explain your intuition behind why this is the case in the assignment.



Part 1 (20 pts) : Decision Tree (DT). For this part we are interested in using a decision tree with below configuration:

•    The DT uses gini-index to measure the uncertainty. Specifically if we have a node split from training



list A to two left and right list AL and AR as depicted in figure 1 then



Figure 1: Split according to feature fi testing against value v

the benefit of split for feature fi against value v is computed as:

                                                                                   B = U(A) − plU(AL) − prU(AR)                                                                  (1)

Where U is the uncertainty measure. For this assignment, we will use gini-index, which is computed for a list such as AL as follows:

                                                                                (2)

and pl and pr are the probabilities for each split which is given by

                                                                                                                                                                        (3)

•    The features are categorical with more than 2 possible values in most cases. So a feature might be tested multiple times against different values in the tree. Most of the logic for this is done for you, but there are a few things you will need to do to get the basic decision tree working.

Please implement the following steps:

(a)    Please read through the starter code provided and work to understand how the recursive process is implemented for building a decision tree. I have chosen to use a fairly common, minimal approach for creating the trees. You will need to know this for modifying the classes for both Random Forest and AdaBoost.

(b)    Add your code into the specified sections of the DecisionTreeClassifier class. You should only need to handle the Gini Impurity calculation (as shown above). Your function should calculate all the impurities and return the gain.

(c)    The base function included in the main.py can be used to test your implementation. (Note: there is one included for Random Forest as well, feel free to make a similar one for AdaBoost when you implement it).

(d)    Now, add a function in main for creating trees with depths ranging from 1 to 25 (inclusive). Plot and explain the behavior of train/testing performance against the depth. That is, plot accuracy versus number of trees, as well as F1 score vs number of trees. (F1 score is a weighted average of precision and recall, I believe you’ve covered precision/recall already).

(e)    Report the depth that gives the best validation accuracy? You will probably hit a point where it will not need to be deeper, and it’s best to use the simplest version.

(f)     What is the most important feature for making a prediction? How can we find it? Report the name of it (see data dictionary) as well as the value it takes for the split.

Part 2 (30 pts) : Random Forest (Bagging). In this part we are interested in random forest which is a variation of bagging without some of its limitations. Please implement the following steps:

(a)    Implement a random forest with parameters: n trees : The number of trees in the forest. max features : The number of features for a tree. max depth : Maximum depth of the trees in the forest.

Here is how the forest is created: The random forest is a collection of n trees trees. All the trees in the forest have maximum depth of max depth. Each tree is built on a data set of size 2098 (size of your original training data) sampled (with replacement) from the training data. You sample both X and y together and need to preserve the indexes that correspond to (X, y) pairs. In the process of building a tree of the forest, each time we try to find the best feature fi to split, we need to first sub-sample (without replacement) max features number of features from the feature set and then pick fi with highest benefit from max features sampled features. Much of this is handled already in your DecisionTreeClassifier, but you need to modify DecisionTreeClassifier class to handle feature bagging. You will also fill in missing RandomForestClassifier code in this class.

(b)    For max depth = 7, max features = 11 and n trees ∈ [10,20,...,200], plot the train and testing accuracy of the forest versus the number of trees in the forest n. Please also plot the train and testing F1 scores versus the number of trees.

(c)    What effect does adding more trees into a forest have on the train/testing performance? Why?

(d)    Repeat above experiments for max depth = 7, n = 50 trees, and max features ∈ [1,2,5,8,10,20,25,35,50]. How does max features change the train/validation accuracy? Why?

(e)    Optional: try to find the best combination of the three parameters using a similar idea.

(f)     With your best result, run 10 trials with different random seeds (for the data/feature sampling you can use the same seed) and report the individual train/testing accuracy’s and F1 scores, as well as the average train/testing accuracy and F1 scores across the 10 trials. Overall, how do you think randomness affects the performance?

Part 3 (30 pts) : AdaBoost (Boosting). For this part we are interested in applying AdaBoost to create yet another ensemble model with decision trees. Considering the AdaBoost algorithm described in the slides, please do following steps:

(a)    Implement your own AdaBoostClassifier class using a modified DecisionTreeClassifier class for the weak learners provided in the code. You should implement it in the same format as the other two classes (DecisionTreeClassifier and RandomForestClassifier). Once again, you will need to slightly modify your DecisionTreeClassifier class. Make a new class for DecisionTreeAdaBoost if you need to (since you’ll have to add the weight vector as described below). However, you need to preserve the original code as much as possible and simply modify the gain and prediction functions to handle the weights described. This helps make grading easier.

(b)    Let the weak learner be a decision tree with depth of only 1 (decision stump). The decision tree should get a weight parameter D which is a vector of size 2098 (training data size). Implement the decision tree with parameter D such that it considers D in its functionality.

(Hint: It changes the probability estimation used for computing gini-impurity and also for the predictions at leaves, use majority ”weights” instead of majority counts. Basically, your weight vector D dictates your predictions, rather than ”majority class”).

(c)    Using the decision tree with parameter D implemented above, develop the AdaBoost algorithm as described in the slide (and part a) with parameter L for the class (number of base classifiers/number of trees).

(d)    Report the train and validation accuracy for L ∈ [10,20,...,200].

(e)    Plot the train/test accuracy and train/test F1 scores of AdaBoost against the parameter L.

Part 4 (10 pts): Class Imbalance Questions Here, I hope you’ve noticed that with a 2:1 ratio of positive to negative class (approximately 66% of the data is class 1), accuracy might not be the best thing to report. Basically, we’re used to seeing data with 50% for each class, and we know we’re doing well if we get better than ”random guessing”, which we can see immediately with accuracy. In this class imbalanced case, we could predict the positive class every time and have a base accuracy of approximately 66%, which might trick us into thinking our model is doing okay. We’ve done nothing to handle this imbalance while training, but there are many things that are often done, and there’s not a ”right” answer for every case. What you need to do here is the following:

(a)    Read this article: https://www.svds.com/learning-imbalanced-classes/

(b)    Write a summary of ideas you have for handling the imbalanced classes in our data. There’s no ”right answer” here, but you should think about things we could use. I’ve added F1 score as a metric, which is coarse, but better than just accuracy. What else could I have done to see how good the classifier is on the testing data? I think this is important because sometimes (very, very often) in the real world we see much worse class imbalance than this, and we don’t want to blindly report metrics like accuracy if we’re reporting the wrong ones. Feel free to report ideas from the article, and if you’re not familiar with precision/recall, refresh on it via Wikipedia or slides.

More products