### PCA (Principal Component Analysis)¶

• A linear dimensionality reduction technique which uses singular value decomposition (SVD) to project a dataset to a lower dimensional space.
• The fit method learns $n$ components.
• PCA centers, but does not scale, inputs before applying SVD.
• whiten=True enables projecting the data onto the singular space while scaling each component to unit variance.
• Uses LAPACK to calculate the full SVD, or Halko etc 2009 to find a randomized truncated SVD. The choice depends on the input data shape & #components to extract.

### Probabilistic PCA vs Factor Analysis¶

• Compare PCA & FA with cross-validation on low-rank data that is corrupted with homoscedastic noise (noise variance is the same for each feature) or heteroscedastic noise (noise variance is different for each feature).
• 2nd step: compare the model likelihood to the likelihoods obtained from shrinkage covariance estimators.

### Example: PCA vs LDA, Iris dataset¶

• PCA finds the combination of features containing the most variance between the samples.
• LDA finds the attributes containing the most variance between classes. LDA is a supervised method, therefore using known class labels.

### Incremental PCA¶

• Standard PCA only supports batch processing - the entire data must be in main memory.
• Incremental PCA enables out-of-core partial computation which can closely match PCA performance.
• partial_fit uses subsets of data fetched sequentially from disk or a network.
• numpy.memmap also enables calling the fit method on sparse matrices or a memory mapped file.
• Stores estimates of component & noise variances, and uses them to update explained_variance_ration incrementally.
• Memory usage depends on #samples/batch.

### PCA with Randomized SVD¶

• Use case: projecting data to a low-D space while preserving most of the variance. This is done by dropping the singular vector of components associated with lower singular values.

• Example: 64x64 pixel gray-level pix (for face recognition) has dimensionality = 4096 (which will make SVM training very slow). Since most human faces look somewhat alike, it makes sense to use PCA to project down to ~200 dimensions.

• In other words: if we're going to drop most of the vectors, it's more efficient to limit the computation to an approximation of the vectors.

• accomplished with svd_solver="randomized".

• The memory footprint is much smaller than that of exact PCA: $2 \cdot n_{\max} \cdot n_{\mathrm{components}}$ vs $n_{\max} \cdot n_{\min}$.

### PCA with Sparse Data¶

• Faster, but less accurate. Faster because it iterates over smaller feature "chunks" for a given #iterations.
• Standard PCA generates dense (non-zero) expressions as linear combinations of the original data. Many problems can instead use sparse vectors as the underlying components. (ex: faces can be deconstructed to face features.)
• Many implementations of this algorithm - Scikit uses Mrl09. The optimization is a dictionary learning problem with an $l1$ penalty (sparsity-inducing) applied to the components: $\begin{split}(U^, V^) = \underset{U, V}{\operatorname{arg\,min\,}} & \frac{1}{2}  ||X-UV||_2^2+\alpha||V||_1 \\ \text{subject to } & ||U_k||_2 = 1 \text{ for all } 0 \leq k < n_{components}\end{split}$
• The $l1$ norm also protects components from noise when few training samples are available. It can be controlled via alpha. (Small values = gentle regularization; large values shrink many coefficients to zero.)

### Example: face recognition with eigenfaces & SVMs¶

• Using "labeled faces in the wild" (LFW) dataset.