### Cross Decomposition / Partial Least Squares (PLS)¶

• A set of supervised estimators for dimensionality reduction & regression. Belongs to the Partial Least Squares (PLS) family of algorithms.
• CD algorithms find relations between two matrices (X,Y), specifically the direction in X that explains the max variance in Y. (In other words, projecting both X & Y into a lower-dimensional subspace such that covariance(transformed(X),transformed(Y)) is maximal.

• Similar to Principal Component Regression (PCR), except that PCR is unsupervised dimensionality reduction (You could lose important info in the process. PLS is similar, but considers the y targets too.)

• PLS estimators are well suited when the predictors matrix has #variables > #observations, and when multicollinearity is present. (Standard linear regression would fail in this situation unless regularized.)

### Canonical PLS¶

• Given two centered matrices $X \in \mathbb{R}^{n \times d}$ and $Y \in \mathbb{R}^{n \times t}$, and #components $K$:
• Set $X_1$ to $X$ and $Y_1$ to $Y$. For each $k \in [1, K]$:
• Find the 1st left ($u_k$) & right ($v_k$) singular vectors of the covariance matrix $C = X_k^T Y_k$. $u_k$ and $v_k$ are the weights. They are chosen to maximize the covariance between the projected $X_k$ and a projected target $\text{Cov}(X_k u_k, Y_k v_k)$.
• Project $X_k$ & $Y_k$ on the singular vectors to obtain scores: $\xi_k = X_k u_k$ & $\omega_k = Y_k v_k$.
• Regress $X_k$ on $xi_k$

### SVD PLS¶

• A simplified version of Canonical PLS: instead of iteratively deflating $X_k$ and $Y_k$, PLSSVD builds the SVD of $C = X^TY$ only once & stores n_components singular vectors corresponding to the biggest singular values found in U and V.

• The transform is transfomed(X)=XU and transformed(Y)=YV'.

• if n_components==1, PLSSVD & Cononical PLS are equal.

### PLS Regression¶

• Similar to Canonical PLS with algorithm='nipals', with 2 differences:

• $v_k$ is not normalized during the $u_k$ & $v_k$ computation step.
• The $Y_k$ targets are approximated with the projection of $X_k$ (ie, $xi_k$) instead of $Y_k$ (ie, $omega_k$). In other words the loading computation is different.
• Because of this, the 'predict and transform` attributes will be different.
• PLS Regression is also known as PLS1 (single targets) and PLS2 (multiple targets). It is a form of regularized linear regression where the #components controls regularization strength.

### Canonical Correlation Analysis (CCA)¶

• A special case of PLS.

### Example: Principal Component Regression (PCR) vs Partial Least Squares Regression (PLSR)¶

• Goal: show how PLS can outperform PCR when a target is correlated with directions that have a low variance.

• PCR has two steps: 1) Apply PCA to the training data (possibly including dimensionality reduction; 2) train a linear regression on the transformed data. The PCA step is unsupervised, so PCR may perform poorly when the target is correlated with low-variance directions.

• PLS does both transformation & regression. It is similar to PCR, except that the transform is supervised.

• Define y such that it is correlated with a low-variance direction.
• Then project X onto the 2nd component, and add some noise.
• Create two regressors, PCR & PLS.
• Set #components=1 for illustration.
• Standardize data (best practice) before feeding data into the PCA step of PCR.
• Plot projected data onto 1st component vs the target. Both regressors will use the projected data for training.
• Note: the unsupervised PCA transform of PCR has dropped the 2nd component (the one with the least variance), despite it being the most predictive direction.
• print R-squared scores of both estimators - which should confirm PLS being the better alternative.
• PCR with 2 components should perform as well as PLS (PCR in this case could leverage the 2nd component = the one with the most predictive power).