When doing supervised learning, compare your estimator against a simple example as a sanity test. DummyClassifier provides several strategies for this.
stratified
: generates random predictions by respecting the training set class distribution.
most_frequent
: always predicts the most frequent label in the training set.
prior
: always predicts the class that maximizes the class prior (like most_frequent) and predict_proba returns the class prior.
uniform
: generates predictions uniformly at random.
constant
always predicts a constant user-specified label.
Note: the predict
method completely ignores the input data.
# test unbalanced dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split as TTS
X, y = load_iris(return_X_y=True)
y[y != 1] = -1
X_train, X_test, y_train, y_test = TTS(X, y, random_state=0)
# compare SVC & most_frequent accuracy
from sklearn.dummy import DummyClassifier as DC
from sklearn.svm import SVC
clf1 = SVC(kernel='linear',
C=1).fit(X_train, y_train)
clf2 = DC(strategy='most_frequent',
random_state=0).fit(X_train, y_train)
print(clf1.score(X_test, y_test))
print(clf2.score(X_test, y_test))
0.631578947368421 0.5789473684210527
clf3 = SVC(kernel='rbf', C=1).fit(X_train, y_train)
print(clf3.score(X_test, y_test))
0.9473684210526315
DummyRegressor also implements four rules of thumb for regression:
mean
: predicts the mean of the training targets.
median
: predicts the median of the training targets.
quantile
: predicts a user provided quantile of the training targets.
constant
: predicts a constant user-specified value.
import numpy as np
from sklearn.dummy import DummyRegressor as DR
X = np.array([1.0, 2.0, 3.0, 4.0])
y = np.array([2.0, 3.0, 5.0, 10.0])
dummy_regr = DR(strategy="mean").fit(X, y)
print(dummy_regr.predict(X),
dummy_regr.score(X, y))
[5. 5. 5. 5.] 0.0