### Novelty/Outlier Detection¶

• Many applications (dataset cleaning being a key example) require being able to decide whether a new sample belongs to a given distribution - an inlier - or not. (an outlier.)

• outlier detection: training data contains outliers (samples that are far from the others). Outlier detectors fit the regions where training data is most concentrated

• novelty detection: training data is contaminated - we want to detect whether a new observation is a novelty.

• Both are used for anomaly detection. Outliers/anomalies cannot form dense clusters - estimators assume they are located in low density regions.

• Learning method: estimator.fit(X_train)

• Testing new observations: estimator.predict(X_test) - Inliers are labeled 1; outliers are labeled -1.

• Predict applies a threshold parameter on a scoring function (score_samples) method.

• The decision_function method is defined from the scoring function - negative values are outliers and non-negative ones are inliers: estimator.decision_function(X_test)

• LOF does not support predict, decision_function or score_samples by default - only a fit_predict method. (This estimator was originally meant to be applied for outlier detection.)

• The abnormality scores of the training samples are accessible via negative_outlier_factor_.

• If you really want to use LocalOutlierFactor for novelty detection (predicting labels or computing abnormality scores of new data), you can instantiate the estimator with the novelty parameter set to True before fitting. In this case, fit_predict is not available.

### Outlier Detection: Method Comparison¶

• Each dataset has 15% of samples generated with random uniform noise.

• Decision boundaries between inliers and outliers are displayed in black except for Local Outlier Factor (LOF) - it has no predict method for new data.

• OneClassSVM is sensitive to outliers - it does not perform well. It is best suited for novelty detection when the training set is not contaminated by outliers.

• That said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying data is very challenging. One-class SVM might give useful results in these situations.

• EllipticEnvelope assumes the data is Gaussian and learns an ellipse. It degrades when the data is not unimodal. It is robust to outliers.

• IsolationForest & LocalOutlierFactor perform reasonably well for multi-modal datasets. Its advantage is shown in the third data set where the two modes have different densities.

• This is explained by the local aspect of LOF - it only compares the abnormality score of one sample with the scores of its neighbors.

• The last dataset is uniformly distributed in a hypercube. Except for the OneClassSVM which overfits a little, all estimators present decent solutions. Look closely at the abnormality scores - a good estimator should assign similar scores to all the samples.

• Note: the model parameters are handpicked - in practice they need to be adjusted. In the absence of labelled data, the problem is completely unsupervised so model selection can be a challenge.

### Novelty Detection - OneClass SVM¶

• Consider a dataset from a single distribution. Are new observations so different that we can doubt it comes from the same distribution?

• One-Class SVM requires the choice of a kernel (typically, RBF) and a scalar parameter to define a frontier.

• nu (aka the margin) corresponds to the probability of finding a new, but regular, observation outside the frontier.

### Example: Species geographic distribution modeling¶

• Two South American mammals with past observations & 14 environmental variables.

• Only positive samples (no unsuccessful observations) - treat problem as density estimation & use OneClassSVM

• Use basemap to plot geography outlines.