Ensembles (Bagging, Forests)¶

• Averaging methods - build multiple independent estimators, then average their predictions.

Bagging Methods¶

• Bagging methods build multiple "black box" estimators using random subsets of a training dataset, then aggregate the predictions.
• Effective on complex models (ie, fully developed decison trees).
• Variations usually occur in the random subset selection process:

• Pasting: random subsets as, well, random subsets.
• Bagging: samples are drawn with replacement.
• Random Subspaces: random subsets are drawn as random subsets of the features.
• Random Patches: when base estimators are built on random subsets of both samples and features.
• Bagging estimators are built using the Bagging meta-estimator with a user-specified estimator & subset selection parameters:

• max_samples: controls sample subset size
• max_features: controls feature subset size
• bootstrap: controls whether samples are drawn with/without replacement
• boostrap_features: controls whether features are drawn with/without replacement
• oob_score=True: When using a subset of samples, the generalization accuracy can be estimated with out-of-bag samples.

Example: Single estimator vs Bagging¶

• Illustrates the bias-variance relation of an Expected Mean Squared Error, single estimator vs a bagging ensemble.
• Upper left: predictions (dark red) of a single decision tree.
• Trained on a random dataset (blue dots) from a toy 1D regression.
• Variance corresponds to the width of the beam of predictions (light red).
• The larger the variance = the more sensitivity of x to the training data.
• Bias corresponds to the difference between average estimator prediction (cyan) and the best possible model (dark blue).
• Observed bias is quite low (cyan & blue curves close together)
• Observed variance is high (red beam is wide)
• Lower left: pointwise decomposition of the expected MSE of a single decision tree.
• Confirms the bias (blue) is low, while variance (gree) is large.
• Confirms the noise component is a relatively constant 0.01.
• Upper & lower right: same plots, based on a bagging ensemble of decision trees.

• Total error of the bagging ensemble is lower than the single decision tree - mainly due to the reduced variance.

Random Forest Classifiers¶

• Sklearn has two averaging algorithms: Random Forest and Extra Trees.
• Forest classifiers accept a training array X (#samples, #features) and a class labels array Y (#samples).
• Random Forest methods also can be used on multi-output problems if Y is expanded to (#samples, #outputs).
• Each tree is built from a sample drawn with replacement ("bootstrap") from the training data. Each node split is found from either all input features, or a random subset of size max_features.
• The randomness helps to decrease estimator variance - individual trees usually have high variance and tend to overfit.

Extra Trees Classifiers¶

• Extra Trees, aka "extremely randomized trees", take the randomness one step further.
• Instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature. The best of these randomly-generated thresholds is picked as the splitting rule.
• Usually results in slightly lower variance and slightly higher bias.

Parameters¶

• n_estimators (#trees in the forest) and max_features (#random feature subsets used for node splits) are the main params.
• More trees is better, but will take longer to compute - and results will eventually plateau.
• Smaller #max_featuresmeans more reduction of variance, but also greater bias.
• Good starting points:
• max_features=None (always considering all features) for regressions
• max_features="sqrt" (use random subset of size sqrt(n_features) for classifications.
• Always consider using CV to pick param values.
• Default bootstrap sample strategy:
• yes for Random Forests (bootstrap=True)
• no (instead use entire dataset) for Extra Trees (bootstrap=False)
• Model size with default settings: $O(M*N*log(N))$ where $M$ = #trees, $N$ = #samples
• use min_samples_split, max_leaf_nodes, max_depth and min_samples_leaf to reduce model size if needed.

Parallel Execution¶

• n_jobs=k enables parallel tree construction & prediction computation, where $k$ is the #jobs to run on $k$ cpu cores. (If n_jobs=-1, all available cores will be used.)
• The speedup will not be linear due to inter-cpu communication latencies.

Feature Importance¶

• The rank (aka depth) of a feature can signal its importance to prediction accuracy.
• Features at the top of a tree contribute to the decision in a larger fraction of inputs, so the expected fraction of samples can be used as an importance measure.
• Scikit-Learn combines this with the decrease in impurity from splitting to create a normalized estimate.
• Averaging these estimates across multiple trees (aka "mean decrease in impurity", "MDI") can reduce variance.
• MDI values computed from training data don't necessarily translate to features in held-out datasets. They also favor high-cardinality (many unique values) features.
• Permutation Feature Importance is an alternative approach that doesn't have these flaws.
• The estimates are stored in feature_importances_ - an array of #features with all positive values summing to 1.0.

Feature importance, Extra Trees¶

• red bars: impurity-based feature importance, plus inter-tree variability.
• plot suggest 3 informative features.

Random Tree Embedding¶

• Computes an unsupervised data transformation with completely random trees.
• It encodes data using the leaf indice where it eventually resides.
• The index is one-of-K encoded - leading to high-dimensional, sparse binary coding - which can be used as the basis for further learning.
• Given that neighboring data is more likely to reside in the same leaves, the transform effectively does a non-parametric density estimate.