### Decision Trees¶

• Non-parametric technique for classification & regression
• Simple to understand, interpret & visualize using plot_tree.
• Export results to Graphviz format using export_graphviz. Import via conda install python-graphviz.
• Requires little data prep.
• Computational cost is logarithmic, ie log(#datapoints).
• Can handle numerical & categorical data (note: Scikit-Learn implementation does not support categories at this time.)
• Multi-output problem support.
• Can be validated with statistical tests.
• Prone to overfitting with overly complex trees. Pruning, fixing maximum leaf sizes and fixing maximum tree depths help.
• Prone to instability due to data variations causing completely different trees to be built. Ensembles help.
• Cannot guarantee globally optimal solutions due to algorithms choosing locally optimal answers at each node. Ensembles help.
• Some concepts, such as XOR, do not lend themselves well to tree techniques.
• Prone to bias in unbalanced dataset problems. Prebalancing datasets prior to fitting will help.

### DT Classifier¶

• Accepts array X (#samples, #features) (dense or sparse) of training samples, and array Y (#samples) of class labels for training.

### Graphviz rendering to PDF¶

file.render("example-tree")

### DT Regression¶

• Same principles as DT Classification, except y (labels) are expected to be floating point values.

### Multiple-output Decision Trees¶

• Supervised learning problem with multiple outputs to predict
• Y is a 2D array of (#samples, #outputs)
• If no correlation between the outputs, the simplest solution is to build $n$ independent models - one per output.
• If correlation is suspected, it's often better to build a single, multi-output-capable model by storing $n$ output values in leaves (instead of one), and by using an average reduction criteria for splitting.
• DTC and DTR both support multiple outputs.
• Outputs:
• predict: a list of n output values
• predict_proba: a list of n arrays of class probabilities

### Example: multi-output regression¶

• Dataset is a circle with one underlying feature
• Note how overfit occurs if max_depth is set too high.

### Complexity¶

• Binary tree build time: $O(n_{samples}n_{features}\log(n_{samples}))$
• Binary tree query time: $O(\log(n_{samples}))$
• Trees will not always be balanced. assuming balanced subtrees, the cost of searching each node for the largest reduction of entropy is $O(n_{features}n_{samples}\log(n_{samples}))$ per node.

### DT Algorithms¶

• ID3 creates a multiway tree & finds the categorical feature that yields a maximum categorical information gain for each node. Trees are grown to a maximum size, then pruned to improve the tree's ability to generalize on unknown data.

• C4.5 succeeded ID3 and removed the categorical variable restriction. It converts trained trees into sets of if-then rules, which are evaluated to determine an application order. Pruning is done by removing a rule's precondition if accuracy improves without it.

• C5.0 succeeeds C4.5. It is available under a proprietary license.

• CART is similar to C4.5 & supports numerical (regression) targets. It builds binary trees using the feature & threshold that yields the largest information gain at each node. Scikit-Learn uses an optimized version of CART.

• The quality of a candidate split of a node is found with an impurity or loss function $H()$. How $H()$ is computed depends on the task being solved.

• If a target is a classification outcome, $p_{mk}$ is the proportion of class $k$ observations in node $m$. $p_{mk} = 1/ N_m \sum_{y \in Q_m} I(y = k)$. Here are the most common criteria for classification-based splits.

• Gini: $H(Q_m) = \sum_k p_{mk} (1 - p_{mk})$
• Entropy: $H(Q_m) = - \sum_k p_{mk} \log(p_{mk})$
• Misclassification: $H(Q_m) = 1 - \max(p_{mk})$
• If the target is a continuous (regression) value, here are the most common criteria for regression-based splits:

• Mean Squared Error (MSE, aka L2 error): \begin{align}\begin{aligned}\bar{y}_m = \frac{1}{N_m} \sum_{y \in Q_m} y\\H(Q_m) = \frac{1}{N_m} \sum_{y \in Q_m} (y - \bar{y}_m)^2\end{aligned}\end{align}
• Poisson deviance: $H(Q_m) = \frac{1}{Nm} \sum{y \in Q_m} (y \log\frac{y}{\bar{y}_m} • y + \bar{y}_m)$

  - Using criteria="poisson" is recommended if your target is a count or frequency. $y>=0$ is required for this criterion. It fits much slower than MSE.


• Mean Absolute Error (MAE): \begin{align}\begin{aligned}median(y)_m = \underset{y \in Q_m}{\mathrm{median}}(y)\\H(Q_m) = \frac{1}{N_m} \sum_{y \in Q_m} |y - median(y)_m|\end{aligned}\end{align}
• MAE also fits much slower than MSE.

### Minimal cost-complexity Pruning¶

• Used to prune trees to avoid overfitting.
• Controlled with parameter $\alpha>=0$ (complexity), which defines a cost-complexity measure of a tree $T$: $R_\alpha(T) = R(T) + \alpha|\widetilde{T}|$
• $|\widetilde{T}|$ is the number of terminal nodes in $T$.
• $R(T)$ is the total misclassification rate of the terminal nodes.
• Non-terminal nodes with the smallest effective alpha value are pruned.
• The process stops when the tree's minimal effective alpha is greater than ccp_alpha.

### Example: Minimal cost-complexity pruning¶

• Demonstrate the effect of ccp_alpha on tree regularization
• Demonstrate how to choose ccp_alpha with validation scores
• use cost_complexity_pruning_path to find the effective alphas & corresponding leaf impurities at each step of the pruning process.
• As alpha increases, more of the tree is pruned - which increases total impurity.
• Train a DT using the effective alphas.
• The last value in ccp_alphas will prune the entire tree - leaving it (clfs[-1]) with a single node.
• Show the #nodes & tree depth as alpha increases:
• When using the DT Classifier with ccp_alpha=0 and all other parameters set to their defaults, the tree will overfit.
• As alpha increaes & the tree is pruned, the tree will become better at generalization. ccp_alpha=0.015 should provide a max testing accuracy.