Many applications (dataset cleaning being a key example) require being able to decide whether a new sample belongs to a given distribution - an inlier - or not. (an outlier.)
outlier detection: training data contains outliers (samples that are far from the others). Outlier detectors fit the regions where training data is most concentrated
novelty detection: training data is contaminated - we want to detect whether a new observation is a novelty.
Both are used for anomaly detection. Outliers/anomalies cannot form dense clusters - estimators assume they are located in low density regions.
Learning method: estimator.fit(X_train)
Testing new observations: estimator.predict(X_test)
- Inliers are labeled 1; outliers are labeled -1.
Predict applies a threshold
parameter on a scoring function (score_samples
) method.
The decision_function
method is defined from the scoring function - negative values are outliers and non-negative ones are inliers: estimator.decision_function(X_test)
LOF does not support predict
, decision_function
or score_samples
by default - only a fit_predict
method. (This estimator was originally meant to be applied for outlier detection.)
The abnormality scores of the training samples are accessible via negative_outlier_factor_
.
If you really want to use LocalOutlierFactor
for novelty detection (predicting labels or computing abnormality scores of new data), you can instantiate the estimator with the novelty parameter set to True
before fitting. In this case, fit_predict
is not available.
Each dataset has 15% of samples generated with random uniform noise.
Decision boundaries between inliers and outliers are displayed in black except for Local Outlier Factor (LOF) - it has no predict method for new data.
OneClassSVM is sensitive to outliers - it does not perform well. It is best suited for novelty detection when the training set is not contaminated by outliers.
That said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying data is very challenging. One-class SVM might give useful results in these situations.
EllipticEnvelope assumes the data is Gaussian and learns an ellipse. It degrades when the data is not unimodal. It is robust to outliers.
IsolationForest & LocalOutlierFactor perform reasonably well for multi-modal datasets. Its advantage is shown in the third data set where the two modes have different densities.
This is explained by the local aspect of LOF - it only compares the abnormality score of one sample with the scores of its neighbors.
The last dataset is uniformly distributed in a hypercube. Except for the OneClassSVM which overfits a little, all estimators present decent solutions. Look closely at the abnormality scores - a good estimator should assign similar scores to all the samples.
Note: the model parameters are handpicked - in practice they need to be adjusted. In the absence of labelled data, the problem is completely unsupervised so model selection can be a challenge.
import time
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.datasets import make_moons, make_blobs
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
matplotlib.rcParams['contour.negative_linestyle'] = 'solid'
# Example settings
n_samples = 300
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers
anomaly_algorithms = [
("Robust covariance",
EllipticEnvelope(contamination=outliers_fraction)),
("One-Class SVM",
svm.OneClassSVM(nu=outliers_fraction,
kernel="rbf",
gamma=0.1)),
("Isolation Forest",
IsolationForest(contamination=outliers_fraction,
random_state=42)),
("Local Outlier Factor", LocalOutlierFactor(
n_neighbors=35,
contamination=outliers_fraction))]
# Define datasets
blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
datasets = [
make_blobs(centers=[[0, 0], [0, 0]],
cluster_std=0.5,
**blobs_params)[0],
make_blobs(centers=[[2, 2], [-2, -2]],
cluster_std=[0.5, 0.5],
**blobs_params)[0],
make_blobs(centers=[[2, 2], [-2, -2]],
cluster_std=[1.5, .3],
**blobs_params)[0],
4.0*(make_moons(n_samples=n_samples,
noise=.05,
random_state=0)[0] - np.array([0.5, 0.25])),
14.0*(np.random.RandomState(42).rand(n_samples, 2) - 0.5)]
xx, yy = np.meshgrid(np.linspace(-7, 7, 150),
np.linspace(-7, 7, 150))
plt.figure(figsize=(len(anomaly_algorithms)*2+3, 12.5))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96,
wspace=.05, hspace=.01)
plot_num = 1
rng = np.random.RandomState(42)
for i_dataset, X in enumerate(datasets):
X = np.concatenate([X,
rng.uniform(low=-6, high=6,
size=(n_outliers, 2))],
axis=0)
for name, algorithm in anomaly_algorithms:
t0 = time.time()
algorithm.fit(X)
t1 = time.time()
plt.subplot(len(datasets), len(anomaly_algorithms), plot_num)
if i_dataset == 0:
plt.title(name, size=18)
if name == "Local Outlier Factor":
y_pred = algorithm.fit_predict(X)
else:
y_pred = algorithm.fit(X).predict(X)
if name != "Local Outlier Factor": # LOF does not implement predict
Z = algorithm.predict(np.c_[xx.ravel(),
yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[0],
linewidths=2, colors='black')
colors = np.array(['#377eb8', '#ff7f00'])
plt.scatter(X[:, 0],
X[:, 1],
s=10, color=colors[(y_pred + 1) // 2])
plt.xlim(-7, 7);
plt.ylim(-7, 7)
plt.xticks(())
plt.yticks(())
plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
transform=plt.gca().transAxes, size=15,
horizontalalignment='right')
plot_num += 1
Consider a dataset from a single distribution. Are new observations so different that we can doubt it comes from the same distribution?
One-Class SVM requires the choice of a kernel (typically, RBF) and a scalar parameter to define a frontier.
nu
(aka the margin) corresponds to the probability of finding a new, but regular, observation outside the frontier.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn import svm
xx, yy = np.meshgrid(np.linspace(-5, 5, 500),
np.linspace(-5, 5, 500))
# numpy.r_: Translates slice objects to concatenation along the first axis.
X = 0.3*np.random.randn(100, 2); X_train = np.r_[X+2, X-2]
X = 0.3*np.random.randn( 20, 2); X_test = np.r_[X+2, X-2]
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1).fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
n_error_train = y_pred_train[ y_pred_train == -1].size
n_error_test = y_pred_test[ y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size
# plot the line, the points, and the nearest vectors to the plane
Z = clf.decision_function(np.c_[xx.ravel(),
yy.ravel()])
Z = Z.reshape(xx.shape)
plt.title("Novelty Detection - OneClassSVM")
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.PuBu)
plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='palevioletred')
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='darkred')
s = 40
b1 = plt.scatter(X_train[ :, 0], X_train[ :, 1], c='white', s=s, edgecolors='k')
b2 = plt.scatter(X_test[ :, 0], X_test[ :, 1], c='blueviolet', s=s, edgecolors='k')
c = plt.scatter(X_outliers[:, 0], X_outliers[:, 1], c='gold', s=s, edgecolors='k')
plt.axis('tight')
plt.xlim((-5, 5))
plt.ylim((-5, 5))
plt.legend([a.collections[0], b1, b2, c],
["learned frontier", "training data",
"new regular data", "new abnormal data"],
loc="upper left",
prop=matplotlib.font_manager.FontProperties(size=11))
plt.xlabel(
"error train: %d/200 ; errors novel regular: %d/40 ; "
"errors novel abnormal: %d/40"
% (n_error_train, n_error_test, n_error_outliers))
Text(0.5, 0, 'error train: 18/200 ; errors novel regular: 8/40 ; errors novel abnormal: 0/40')
Two South American mammals with past observations & 14 environmental variables.
Only positive samples (no unsuccessful observations) - treat problem as density estimation & use OneClassSVM
Use basemap to plot geography outlines.
from time import time
import numpy as np
import matplotlib.pyplot as plt
from sklearn.utils import Bunch
from sklearn.datasets import fetch_species_distributions
from sklearn import svm, metrics
# if basemap is available, we'll use it. otherwise, we'll improvise later...
try:
from mpl_toolkits.basemap import Basemap
basemap = True
except ImportError:
basemap = False