- Classification requires both prediction of class labels
**and**finding a confidence of a correct label. - Some classifiers can do this via
`predict_proba`

, but not all.

- A well-calibrated classifier should, for example, return a
`predict_proba`

value of ~0.8 (~80% of labels predicted correctly). **Logistic Regression**does this - it directly optimizes log-loss.**Gaussian Naive Bayes**pushes probabilities towards 0 or 1, because it assumes features are independent per class. (Not the case in this example.)**Random Forest**, like most bagging methods, has difficulty making predictions near 0 or 1. The dataset's variance will bias samples that*should be near 0 or 1*away from those extremes. Random Forests are especially prone to this, because base-level trees have relatively high variance. So their calibration curves typically take a sigmoid shape.**Support Vector Classification**also takes on a sigmoid shape, which is typical for methods that concentrate on samples close to the decision boundary (the support vectors).

In [1]:

```
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB as GNB
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.svm import LinearSVC as LSVC
from sklearn.calibration import calibration_curve as CC
```

In [2]:

```
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=2)
train_samples = 100
X_train = X[:train_samples]; X_test = X[train_samples:]
y_train = y[:train_samples]; y_test = y[train_samples:]
```

In [6]:

```
plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(LR(), 'Logistic'),
(GNB(), 'Naive Bayes'),
(LSVC(C=1.0), 'Support Vector Classification'),
(RFC(), 'Random Forest')]:
clf.fit(X_train, y_train)
if hasattr(clf, "predict_proba"):
prob_pos = clf.predict_proba(X_test)[:, 1]
else: # use decision function
prob_pos = clf.decision_function(X_test)
prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
fraction_of_positives, mean_predicted_value = \
CC(y_test, prob_pos, n_bins=10)
ax1.plot(mean_predicted_value,
fraction_of_positives, "s-",
label="%s" % (name, ))
ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
histtype="step", lw=2)
ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots (reliability curve)')
ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)
plt.tight_layout()
plt.show()
```

- Calibration is defined as fitting a regresssor (aka "calibrator") that maps a classifier output (via
`decision_function`

or`predict_proba`

) to a calibrated probability in [0,1]. - The calibrator tries to predict $p(y_i = 1 | f_i)$.
- Do not fit the calibrator with the same data used for classifier fit - this will introduce bias.

Use this method to ensure the calibrator is always fitted with unbiased data. It splits data into k pairs of

`(train_set, test_set)`

.(default) tells the calibrator to do the following steps independently on each train-test pair:`ensemble=True`

- Clone the
`base_estimator`

and train it on the training subset. - Predict labels on the test subset.
- The predictions and a
*regressor*(sigmoid or isotonic - see below) to fit a calibrator. - This returns an ensemble of $k$
`(classifier, calibrator)`

pairs. - Each pair is available in
`calibrated_classifiers_`

. `predict_proba`

for the main instance is the average of the $k$ estimators in the list.`predict`

returns the class with the highest probability.

- Clone the

tells the calibrator to find unbiased predictions for the entire dataset via cross_val_predict.`ensemble=False`

`calibrated_classifiers_`

contains only one`(classifier, calibrator)`

pair - the classifier is the`base_estimator`

trained on all data.`predict_proba`

is the predicted probabilities from the single pair.

`ensemble=True`

should return slightly better accuracy, as it benefits from typical ensemble effects.`ensemble=False`

is better when computation time is a concern.You can use

`cv="prefit"`

if a prefitted classifier is already available.

- brier score loss can be used to assess calibration performance.
- Brier scores are a combination of
*calibration*and*refinement*losses. - Refinement losses can change independently of calibration losses - so lower Brier scores do not necessarily indicate better calibration.

- Brier scores are a combination of

- The
**Sigmoid**regressor is based on this logistic model: $p(y_i = 1 | f_i) = \frac{1}{1 + \exp(A f_i + B)}$, ($y_i$ is the true label of sample $i$, $f_i$ is the uncalibrated classifier output for that sample. $A$ & $B$ are found via maximum likelihood. - The sigmoid method assumes calibration curves can be corrected by applying a sigmoid function to raw predictions.
This method is most effective when the uncalibrated model is under-confident, and has similar errors for both high & low outputs.

The

**Isotonic**regressor returns a non-decreasing function. It minimizes $\sum_{i=1}^{n} (y_i - \hat{f}_i)^2$.- The Isotonic method is more general-purpose than sigmoids - the only restriction is the mapping function needing to be monotonically increasing. It is more prone to overfitting on small datasets.
- This method can be more effective when there is enough data (>1K samples) to avoid overfitting.

- Both regressors only support 1D data (binary classification), but are extended for multiclass classification if the
`base_estimator`

supports it.

In [ ]:

```
```