predict_proba
, but not all.predict_proba
value of ~0.8 (~80% of labels predicted correctly).import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB as GNB
from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.svm import LinearSVC as LSVC
from sklearn.calibration import calibration_curve as CC
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=2)
train_samples = 100
X_train = X[:train_samples]; X_test = X[train_samples:]
y_train = y[:train_samples]; y_test = y[train_samples:]
plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(LR(), 'Logistic'),
(GNB(), 'Naive Bayes'),
(LSVC(C=1.0), 'Support Vector Classification'),
(RFC(), 'Random Forest')]:
clf.fit(X_train, y_train)
if hasattr(clf, "predict_proba"):
prob_pos = clf.predict_proba(X_test)[:, 1]
else: # use decision function
prob_pos = clf.decision_function(X_test)
prob_pos = (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
fraction_of_positives, mean_predicted_value = \
CC(y_test, prob_pos, n_bins=10)
ax1.plot(mean_predicted_value,
fraction_of_positives, "s-",
label="%s" % (name, ))
ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
histtype="step", lw=2)
ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots (reliability curve)')
ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)
plt.tight_layout()
plt.show()
decision_function
or predict_proba
) to a calibrated probability in [0,1].Use this method to ensure the calibrator is always fitted with unbiased data. It splits data into k pairs of (train_set, test_set)
.
ensemble=True
(default) tells the calibrator to do the following steps independently on each train-test pair:
base_estimator
and train it on the training subset.(classifier, calibrator)
pairs.calibrated_classifiers_
.predict_proba
for the main instance is the average of the $k$ estimators in the list.predict
returns the class with the highest probability.ensemble=False
tells the calibrator to find unbiased predictions for the entire dataset via cross_val_predict.
calibrated_classifiers_
contains only one (classifier, calibrator)
pair - the classifier is the base_estimator
trained on all data.predict_proba
is the predicted probabilities from the single pair.ensemble=True
should return slightly better accuracy, as it benefits from typical ensemble effects. ensemble=False
is better when computation time is a concern.
You can use cv="prefit"
if a prefitted classifier is already available.
This method is most effective when the uncalibrated model is under-confident, and has similar errors for both high & low outputs.
The Isotonic regressor returns a non-decreasing function. It minimizes $\sum_{i=1}^{n} (y_i - \hat{f}_i)^2$.
base_estimator
supports it.