$30
Ask questions to #ask-your-tutor-luisa
This exercise will introduce you to tree-based machine learning methods. Use the stubs provided in file tree-methods-stubs.ipynb on MaMPF as a starting point for this homework.
1 Density Tree and Decision Tree
Complete the code of the classes DensityTree and DecisionTree in the stub, following the explanations given in the lecture and in code comments. Note in particular, that the prediction of a density should be zero for test points that are located outside the tree’s training bounding box.
The optimal splits of a density tree shall minimize the leave-one-out error of the resulting children. For a given node l, the leave-one-out error is defined as looErr
where Nl and Vl are the number of instances in node l and its volume, and N is the total number of instances for the class under consideration. The region spanned by node l is represented by the vectors ml and Ml of lower and upper bounds respectively, so that .
In case of a decision tree, the optimal splits shall minimize the Gini impurity of the resulting children. The Gini impurity of node l is defined as
!
Gini
where Nl is the total number of instances in node l, and Nlk are the number of instances of class k in l.
√
To save computation time, only a random subset of size Dtry = D of the features shall be considered when searching for the optimal split in each node. Note that different subsets must be selected in every node. Function numpy.random.permutation() and “advanced indexing” (see https:
//docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing) will be helpful here.
Candidate thresholds shall be placed in the middle between consecutive feature values
1/2
provided that X[i]j and X[i+1]j are different. The brackets in X[i]j denote sorted order w.r.t. to feature j. If the two feature values are the same, no threshold shall be placed there. Describe in a comment why this is necessary.
2 Evaluation of Density Tree and Decision Tree
Once again, use the digits dataset provided by sklearn. This time, we will not split the data into training and test sets and just measure the training error (otherwise, the dataset would become too small for density estimation). Train a generative classifier using one instance of DensityTree per digit class, and a discriminative classifier using one instance of DecisionTree. You may try different values for the hyperparameter n_min to improve performance. For each method, show the full training error confusion matrix and comment on the results.
3 Density Forest and Decision Forest
Complete the code of the classes DensityForest and DecisionForest in the stub, following the explanations given in the lecture and in code comments. To make the trees in the forest independent of each other, create a new bootstrap training set for each tree. That is, each new training set has the same size as the original one and is obtained by selecting instances from the original training data uniformly at random with replacement. Functions numpy.random.choice() and “advanced indexing” are useful here.
In case of the DecisionForest, it makes sense to drop the size restriction of the split nodes, i.e. to train all leaves to purity, by setting hyperparameter n_min = 0.
The ensemble prediction shall be the average of the individual tree predictions.