• Fits a sequence of weak learners on repeatedly modified versions of a dataset.
• Combines the predictions via a weighted majority voting system.
• Uses a weights vector ($w_1, w_2, w_N$) to boost each training sample - each weight is initialized to 1/N
• Incorrectly predicted samples have their weights increased for subsequent runs.
• Correctly predicted samples have their weights decreased for subsequent runs.
• Therefore, subsequent learners are forced to concentrate on examples that were previously missed.
• n_estimators controls the numbers of learners.
• learning_rate controls the contribution of weak learners in the final combination.
• Weak learners are, in effect, decision stumps.
• base_estimator controls the estimator method.

### Example: Discrete (SAMME) vs Real (SAMME.R) AdaBoost comparison¶

• Y is nonlinear function with 10 inputs.

• Compares accuracy of SAMME (classification only) to SAMME.R (updates additive model)
• Dataset: 10-dimensional std normal distribution
• 3 classes separated by nested, concentric 10-D spheres
• Roughly equal numbers of samples in each class
• Expectation: SAMME.R should converge faster, with lower test error
• Algorithm error on the left; classification error in the middle; boost weights on the right.

• Plots decision boundaries and decision scores
• Scores>0 are "B", otherwise "A"
• Score magnitude indicates confidence

### Example: Decision Tree Regression with AdaBoost¶

• 1D sinusoidal data with Gaussian noise

• Boosting mechanism using arbitrary differentiable loss functions
• Scikit-Learn implementations inspired by LightGBM
• New histogram-based classifier & regressor are much faster than the std Gradient Boosting classifier & regressor when sample sizes > 10K.

• Supports binary & multiclass classification
• Number of weak learners (regression trees) is controlled with n_estimators. Tree size is controlled with max_depth and max_leaf_nodes.
• learning_rate controls overfitting via shrinkage. It is valued as [0.0..1.0].
• Classification with >2 classes requires induction of n_classes trees at each iteration. You should use HistGradientBoostingClassifier for datasets with large numbers of classes.

• Supports different loss functions via loss. The default is least squares (ls).

• Applies LS loss & 500 base learners to the diabetes dataset.
• Plots training & test error vs iteration count.

### Tree Size¶

• The size of the regression tree base learners defines the level of variable interactions that can be captured by gradient boosting. In general, a tree of depth h captures interactions of order h.

• max_depth=h allows binary trees of depth h. They will have (at most) 2^h leaf nodes and 2^h-1 split nodes.

• max_leaf_nodes controls the number of leaf nodes. In this case, trees will grow using best-first search - nodes with the highest improvement in impurity will be expanded first.

• max_leaf_nodes=k gives comparable results to max_depth=k-1 but is significantly faster to train vs a slightly higher training error.

• TODO

### Loss Functions¶

• Regression:
• least squares (ls): the default choice. Initial model given by the mean of the target values.
• least absolute deviation (lad): Initial model given by the median of the target values.
• huber (huber): Combines LS & LAD; uses alpha to control outlier sensitivity.
• quantile (quantile): For quantile regression, aka prediction intervals. Uses 0<alpha<1 to specify the quantile.
• Classification:
• binomial deviance (deviance): Uses a negative binomial log-likelihood loss function for binary classification.
• multinomial deviance (deviance): Uses a negative multinomial log-likelihood loss function for multiclass classification. Builds n_classes regression trees per iteration, which makes GBRT inefficient for large numbers of classes.
• exponential loss (exponential): Also used by AdaBoost. Loss robust to mislabeled samples than deviance.

### Shrinkage via Learning Rate¶

• Regularization strategy: scales the contribution of each weak learner by $v$ (learning rate): $F_m(x) = F_{m-1}(x) + \nu h_m(x)$.

• The learning rate (learning_rate) scales the step rate in a gradient descent algorithm. It strongly influences n_estimators (the number of weak learners).

• Smaller learning rates require larger numbers of weak learners to maintain a constant training error.

• Evidence indicates smaller learning rates favor better test errors. Recommend starting with a learning rate <0.1.

### Subsampling¶

• Stochastic gradient boosting (SGB) combines GB with bootstrap averaging (bagging). At each iteration, a base classifier is trained on a subsample (fraction, drawn without replacement, typical: 0.5) of the training data.

• Below example:

• shows how shrinkage outperforms no-shrinkage.
• Subsampling + shrinkage even better.
• Subsampling without shrinkage = poor.
• Also: subsampling features will reduce variance. The number of subsampled features is controlled with max_features. Small values will significantly reduce runtimes.

• SGB allows out-of-bag (OOB) deviance estimates by finding the improvement in deviance on the examples not included in the bootstrap. The improvements are tracked in oob_improvement. This can be used, for example, for feature selection.

• OOB estimates are usually very pessimistic. Use cross-validation instead - resort to OOB only if CV is too time-consuming.

### Interpretation with Feature Importance¶

• Individual decision trees can be evaluated by visualizing the tree structure. Gradient boosting models include hundreds of regression trees & cannot be easily interpreted by visual inspection. Fortunately, a number of techniques have been proposed to summarize and interpret gradient boosting models.

• Features usually do not contribute equally to a target response; in many situations the majority features are irrelevant.

• Individual decision trees perform feature selection by selecting appropriate split points. This can be used to measure the importance of each feature - the more often a feature is used in the split points, the more important that feature is.

• This notion can be by averaging the impurity-based feature importance of each tree

• The feature importance scores of a fitted GB model can be accessed via feature_importances_.

• Note: this evaluation is based on entropy. It is distinct from permutation_importance, which depends on feature permutations.

• illustrates effectiveness of different regularization strategies for Gradient Boosting - artificial dataset.
• loss function: binomial deviance

• Introduced in Scikit-Learn v0.21; much faster than standard Gradient Boosting for datasets larger than 10K samples. Experimental release: API not 100% compatible with standard Gradient Boosting.
• classification and regression supported.

• Number of bins controlled with max_bins. Less bins = more regularization. General rule: use as many bins as possible.

• Loss function regularization is controlled with l2_regularization.

• Loss function options:

• 'least_squares' & 'least_absolute_deviation' - less sensitive to outliers
• 'poisson' - suited for counts & frequencies
• 'binary_crossentropy' - for binary classification
• 'categorical_crossentropy' - for multiclass classification
• 'auto': default; loss function is chosen depending the the $y$ passed to the fit function.
• Early stopping enabled by default if #samples>10000.

• Can be controlled by early-stopping, scoring, validation_fraction, n_iter_no_change and tol.
• Built-in missing value support - no need for standalone imputation.

• The tree growth algo determine whether a sample with missing data should "go left" or "go right", depending on the potential gain.
• When the missingness pattern is predictive, the splits can be done on whether the feature value is missing or not:
• Histogram Gradient Boosting supports sample weights.
• Histogram Gradient Boosting supports categorical features.
• Native category support is often better than one-hot encoding, because one-hot requires more tree depth to achieve equivalent splits.
• It is also usually better to use native category support, since the order of categories should not matter.
• Done by passing a boolean mask, or a list of integer indices, to categorical_features to indicate which are categories:

### Example: Histogram Gradient Boosting - Category Feature Support¶

• Compares training tmies & prediction performance of a regressor with different encoding strategies:

### Histrogram Gradient Boosting - Monotonic Constraints¶

• In some situations, you may already know if a given feature has a positive (or negative) effect on the target value. (For example, credit scores.) Monotonic constraints allow you to add such prior knowledge into the model.
• Definitions (where $F$ is a two-featured predictor):
• Positive MC:$x_1 \leq x_1' \implies F(x_1, x_2) \leq F(x_1', x_2)$
• Negative MC:$x_1 \leq x_1' \implies F(x_1, x_2) \geq F(x_1', x_2)$
• monotonic_cst controls the constraint. 0 = no constraint; -1 = negative, +1 = positive.
• Used in binary classification problems.