### Probability calibrarion¶

• Classification requires both prediction of class labels and finding a confidence of a correct label.
• Some classifiers can do this via predict_proba, but not all.

### Example: Classifier Confidence Comparison¶

• A well-calibrated classifier should, for example, return a predict_proba value of ~0.8 (~80% of labels predicted correctly).
• Logistic Regression does this - it directly optimizes log-loss.
• Gaussian Naive Bayes pushes probabilities towards 0 or 1, because it assumes features are independent per class. (Not the case in this example.)
• Random Forest, like most bagging methods, has difficulty making predictions near 0 or 1. The dataset's variance will bias samples that should be near 0 or 1 away from those extremes. Random Forests are especially prone to this, because base-level trees have relatively high variance. So their calibration curves typically take a sigmoid shape.
• Support Vector Classification also takes on a sigmoid shape, which is typical for methods that concentrate on samples close to the decision boundary (the support vectors).

### Calibration¶

• Calibration is defined as fitting a regresssor (aka "calibrator") that maps a classifier output (via decision_function or predict_proba) to a calibrated probability in [0,1].
• The calibrator tries to predict $p(y_i = 1 | f_i)$.
• Do not fit the calibrator with the same data used for classifier fit - this will introduce bias.

### Cross-Validation¶

• Use this method to ensure the calibrator is always fitted with unbiased data. It splits data into k pairs of (train_set, test_set).

• ensemble=True (default) tells the calibrator to do the following steps independently on each train-test pair:

• Clone the base_estimator and train it on the training subset.
• Predict labels on the test subset.
• The predictions and a regressor (sigmoid or isotonic - see below) to fit a calibrator.
• This returns an ensemble of $k$ (classifier, calibrator) pairs.
• Each pair is available in calibrated_classifiers_.
• predict_proba for the main instance is the average of the $k$ estimators in the list.
• predict returns the class with the highest probability.
• ensemble=False tells the calibrator to find unbiased predictions for the entire dataset via cross_val_predict.

• calibrated_classifiers_ contains only one (classifier, calibrator) pair - the classifier is the base_estimator trained on all data.
• predict_proba is the predicted probabilities from the single pair.
• ensemble=True should return slightly better accuracy, as it benefits from typical ensemble effects. ensemble=False is better when computation time is a concern.

• You can use cv="prefit" if a prefitted classifier is already available.

### Metrics¶

• brier score loss can be used to assess calibration performance.
• Brier scores are a combination of calibration and refinement losses.
• Refinement losses can change independently of calibration losses - so lower Brier scores do not necessarily indicate better calibration.

### Regressors¶

• The Sigmoid regressor is based on this logistic model: $p(y_i = 1 | f_i) = \frac{1}{1 + \exp(A f_i + B)}$, ($y_i$ is the true label of sample $i$, $f_i$ is the uncalibrated classifier output for that sample. $A$ & $B$ are found via maximum likelihood.
• The sigmoid method assumes calibration curves can be corrected by applying a sigmoid function to raw predictions.
• This method is most effective when the uncalibrated model is under-confident, and has similar errors for both high & low outputs.

• The Isotonic regressor returns a non-decreasing function. It minimizes $\sum_{i=1}^{n} (y_i - \hat{f}_i)^2$.

• The Isotonic method is more general-purpose than sigmoids - the only restriction is the mapping function needing to be monotonically increasing. It is more prone to overfitting on small datasets.
• This method can be more effective when there is enough data (>1K samples) to avoid overfitting.

### Multiclass support¶

• Both regressors only support 1D data (binary classification), but are extended for multiclass classification if the base_estimator supports it.