### Covariance¶

• Can be described as an estimate of a dataset's scatter plot shape.

### Empirical (observed) Covariance¶

• A dataset's covariance matrix is well approximated by maximum likelihood estimation (MLE), if the #observations is sufficiently large compared to the #features.

• Empirical covariance results depend on whether the data is centered. If e assume_centered=False, the test set is supposed to have the same mean vector as the training set. If not, both should be centered by the user - and assume_centered=True should be used.

### Shrunk Covariance¶

• MLE is not a good estimator of a covariance matrix's eigenvalues - so the precision matrix (from its inversion) is not accurate. Sometimes an empirical covariance matrix cannot be inverted, for numerical reasons.

• To avoid this problem, Scikit-Learn offers a transformation via a user-defined shrinkage coefficient.

• The shrinkage is defined as reducing the ratio between the smallest & largest eigenvalues of the empirical covariance matrix. It shifts every eigenvalue by a given offset, which is equivalent to finding an L2-penalized MLE of the covariance matrix.

• Mathematically, the shrinkage is $\Sigma_{\rm shrunk} = (1-\alpha)\hat{\Sigma} + \alpha\frac{{\rm Tr}\hat{\Sigma}}{p}\rm Id$ where $\alpha$ defines a bias/variance tradeoff.

### Ledoit-Wolf (LW) shrinkage¶

• The LW covariance matrix estimator is based on a paper that finds an optimal shrinkage coefficient ($\alpha$) that minimizes MLE between the estimated & read covariance matrices.

### Oracle Approximating Shrinkage (OAS)¶

• If a dataset is Gaussian, OAS uses another formula to choose a shrinkage coefficient that yields a smaller MLE than LW.

### Example: LW vs OAS vs MLE¶

• Illustrates 3 methods for setting the bias-variance tradeoff used in shrunk covariance estimators: cross-validation, LW & OAS.

• Plot the likelihood of unseen data vs shrinkage.

• Note: MLE corresponds to no shrinkage - and performs poorly. LW performs well. OAS is slightly further away.

### Sparse Inverse Covariance, aka Precision Matrix¶

• It provides a partial independence relation. (If 2 features are independent conditionally of the others, the corresponding coefficient will be zero.

• This is why estimating a sparse precision matrix makes sense; the estimate is better conditioned by learning independence relations from the data. This is known as covariance selection.

### Example: sparse inverse covariance estimates¶

• Uses the Graphical Lasso estimator to learn a covariance and sparse precision from a small number of samples.

• If we use L2 shrinkage (as with LW) - the #samples is small & we need to shrink a lot. So LW precision is close to the ground truth precision, but the off-diagonal structure is lost.

• The L1-penalized estimator recovers some of the off-diagonal structure. It can't find the exact sparsity pattern (it finds too many non-zero coeffs.) However, the highest non-zero coeffs correspond to ground truth.

• The L1 precision estimate coefficients are biased towards zero. They are all smaller than the corresponding ground truth value because of the penalty.

• The color range of the precision matrices is tweaked for readability. (the full range of empirical precision values is not shown.)

• alpha of GraphicalLasso (the sparsity setting) is set by cross validation.

### Outliers and Minimum Covariance Determinant (MCD) estimators¶

• Real-world datasets are often subject to errors and uncommon observations, aka outliers.

• Empirical and shrunk covariance estimators are very sensitive to outliers. Robust cv estimators can perform outlier detection & discard/downweight outlying data.

• The Minimum Covariance Determinant estimator finds a proportion ($h$) of non-outlying observations & build their empirical cv matrix. This cv matrix is then scaled to compensate for a "consistency step" (?).

• This estimator can be used to give weights to observations according to their Mahalanobis distance, leading to a re-weighted version of the cv matrix.

• The FastMCD algorithm is used to build the MCD object.

### Example: covariance estimates and Mahalanaobis distances¶

• For Gaussian data: the distance of an observation to the mode of the distribution can be found using its Mahalanobis distance: $d_{(\mu,\Sigma)}(x_i)^2 = (x_i - \mu)^T\Sigma^{-1}(x_i - \mu)$ where $\mu$ and $\Sigma$ are the location & covariance of the underlying Gaussian.

• In practice, $\mu$ and $\Sigma$ are replaced with estimates. MLE is very sensitive to outliers - therefore the downstream Mahalanobis distances are.

• It's better to use MCD to guarantee a measure of resistance to outliers.

• This example shows how Mahalaobis distances are altered by outliers: when using standard covariance MLE distances, contaminated observations can't be distinguished from a real Gaussian. The differences become clear using MCD-based Mahalanobis distances.

• Show ability of MCD-based Mahalanobis distances to distinguish outliers.
• The cubic root of Mahalanobis distances yield roughly normal distributions.
• Plot inlier & outlier values with boxplots.
• Outlier distributions should be more separated from the inlier distributions.