16_calibration

Example: Classifier Confidence Comparison ¶

A well-calibrated classifier should, for example, return a predict_proba value of ~0.8 (~80% of labels predicted correctly).
Logistic Regression does this - it directly optimizes log-loss.
Gaussian Naive Bayes pushes probabilities towards 0 or 1, because it assumes features are independent per class. (Not the case in this example.)
Random Forest, like most bagging methods, has difficulty making predictions near 0 or 1. The dataset's variance will bias samples that should be near 0 or 1 away from those extremes. Random Forests are especially prone to this, because base-level trees have relatively high variance. So their calibration curves typically take a sigmoid shape.
Support Vector Classification also takes on a sigmoid shape, which is typical for methods that concentrate on samples close to the decision boundary (the support vectors).

In [6]:

plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(LR(),        'Logistic'),
                  (GNB(),       'Naive Bayes'),
                  (LSVC(C=1.0), 'Support Vector Classification'),
                  (RFC(),       'Random Forest')]:
    clf.fit(X_train, y_train)
    
    if hasattr(clf, "predict_proba"):
        prob_pos = clf.predict_proba(X_test)[:, 1]
    else:  # use decision function
        prob_pos = clf.decision_function(X_test)
        prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
        
    fraction_of_positives, mean_predicted_value = \
        CC(y_test, prob_pos, n_bins=10)

    ax1.plot(mean_predicted_value, 
             fraction_of_positives, "s-",
             label="%s" % (name, ))

    ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
             histtype="step", lw=2)

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots  (reliability curve)')

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()
plt.show()

Calibration¶

Calibration is defined as fitting a regresssor (aka "calibrator") that maps a classifier output (via decision_function or predict_proba) to a calibrated probability in [0,1].
The calibrator tries to predict $p(y_i = 1 | f_i)$.
Do not fit the calibrator with the same data used for classifier fit - this will introduce bias.

Cross-Validation ¶

Use this method to ensure the calibrator is always fitted with unbiased data. It splits data into k pairs of (train_set, test_set).
ensemble=True (default) tells the calibrator to do the following steps independently on each train-test pair:
- Clone the base_estimator and train it on the training subset.
- Predict labels on the test subset.
- The predictions and a regressor (sigmoid or isotonic - see below) to fit a calibrator.
- This returns an ensemble of $k$ (classifier, calibrator) pairs.
- Each pair is available in calibrated_classifiers_.
- predict_proba for the main instance is the average of the $k$ estimators in the list.
- predict returns the class with the highest probability.

ensemble=False tells the calibrator to find unbiased predictions for the entire dataset via cross_val_predict.
- calibrated_classifiers_ contains only one (classifier, calibrator) pair - the classifier is the base_estimator trained on all data.
- predict_proba is the predicted probabilities from the single pair.
ensemble=True should return slightly better accuracy, as it benefits from typical ensemble effects. ensemble=False is better when computation time is a concern.
You can use cv="prefit" if a prefitted classifier is already available.

Metrics¶

brier score loss can be used to assess calibration performance.
- Brier scores are a combination of calibration and refinement losses.
- Refinement losses can change independently of calibration losses - so lower Brier scores do not necessarily indicate better calibration.

Regressors¶

The Sigmoid regressor is based on this logistic model: $p(y_i = 1 | f_i) = \frac{1}{1 + \exp(A f_i + B)}$, ($y_i$ is the true label of sample $i$, $f_i$ is the uncalibrated classifier output for that sample. $A$ & $B$ are found via maximum likelihood.
The sigmoid method assumes calibration curves can be corrected by applying a sigmoid function to raw predictions.
This method is most effective when the uncalibrated model is under-confident, and has similar errors for both high & low outputs.
The Isotonic regressor returns a non-decreasing function. It minimizes $\sum_{i=1}^{n} (y_i - \hat{f}_i)^2$.
The Isotonic method is more general-purpose than sigmoids - the only restriction is the mapping function needing to be monotonically increasing. It is more prone to overfitting on small datasets.
This method can be more effective when there is enough data (>1K samples) to avoid overfitting.

Probability calibrarion ¶

Example: Classifier Confidence Comparison ¶

Calibration¶

Cross-Validation ¶

Metrics¶

Regressors¶

Multiclass support¶

Probability calibrarion¶

Example: Classifier Confidence Comparison¶

Calibration¶

Cross-Validation¶

Metrics¶

Regressors¶

Multiclass support¶

Probability calibrarion ¶

Example: Classifier Confidence Comparison ¶

Cross-Validation ¶