# Linear Models¶

• math notation:
• $\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p$
• vector of coefficients: w = (w_1, ..., w_p), expressed as coef.
• intercept: $w0$, expressed as intercept_.
1. Ordinary Least Squares (OLS)
2. Non-Negative Least Squares (NNLS)
3. Ridge Regression (penalized coefficient sizes)
4. Ridge Classification
5. Ridge Regression with Built-in Cross Validation
6. Lasso (sparse coefficients)

### Ordinary Least Squares (OLS)¶

• complexity: if X is a matrix of (nsamples, nfeatures), OLS has a cost of $O(n_{\text{samples}} n_{\text{features}}^2)$.
• OLS coefficient estimates rely on the independence of the features.
• When features are correlated, OLS becomes sensitive to random errors in the observed targets - producing unacceptable variances.

### Least Squares (Non-Negative)¶

• Wikipedia
• Useful when modeling physical or naturally non-negative quantities. The estimator accepts a boolean postive parameter.

### Ridge regression¶

• Addresses OLS weaknesses by imposing a shrinkage penalty (controlled by $\alpha \geq 0$). Larger values of $\alpha\$ make the coefficients more resistant to collinearity.

### Ridge Classification¶

• Converts binary targets to {-1,+1}, then treats problem as a regression task.
• Predicted class corresponds to the prediction sign.
• Multiclass classification: the problem is treated as a multi-output regression with the predicted class being the highest output.
• Ridge Classification can be much faster than Logistic Regression for problems with large numbers of classes - it has to compute the projection matrix ($(X^T X)^{-1} X^T$) only once.
• Same cost complexity as OLS.

### Ridge Regression with built-in Alpha Cross Validation¶

• Default mode: leave-one-out (LOO) CV

### Lasso Regression¶

• Useful for building models with sparse coefficients.
• Commonly cited for use cases similar to compressed sensing.
• Adds a regularization term to a linear model. The function to minimize is $\min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}$
• $\alpha$ is a constant; $||w||_1$ is the $\ell_1$ norm.
• scikit-learn implementation uses coordinate descent as the fitting algorithm. See Least Angle Regression for an alternative approach.

### lasso_path (API call)¶

• Also useful for computing coefficients

### Model selection using Info criteria techniques¶

• Alternatives: Akeike info criterion (AIC) and Bayes info criterion (BIC)
• Computationally cheaper - the regularization path is computed only once (instead of k+1 times for k-fold cross validation)
• AIC & BIC need a good estimate of the solution's degress of freedom.

• Tasks are the selected features. They are constrained to be the same for all the regression problems in the set.
• MT Lasso solves a linear model with a mixed $\ell_2$ regularization norm.
• The function to minimize is $\min_{W} { \frac{1}{2n_{\text{samples}}} ||X W - Y||_{\text{Fro}} ^ 2 + \alpha ||W||_{21}}$
• $\text{Fro}$ indicates the Frobenius norm $||A||_{\text{Fro}} = \sqrt{\sum_{ij} a_{ij}^2}$
• $\ell_2$ is defined as $||A||_{2 1} = \sum_i \sqrt{\sum_j a_{ij}^2}.$.
• MT Lasso uses coordinate descent as the fitting algorithm.
• Below: simulating sequential measures with each "task" being an instant in time. Relevant features vary in amplitude across time.
• MT Lasso restricts all features selected at one time point are used for all time points.

### Elastic-Net¶

• Another linear regression model trained with $\ell1$ and $\ell2$ regularization of the coefficients.
• Enables learning sparse models (like Lasso) while providing regularization features similar to Ridge. Controlled by l1_ratio parameter.
• Useful when features are correlated to each other.
• Function to minimize: $\min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 + \frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}$

• Estimates sparse coefficients for solving multiple regression problems. (Constraint: the selected features are the same for all regressions, aka tasks.)
• Linear model, trained with a mixed $\ell_1\ell_2$ norm and an $\ell_2$ norm for regularization.
• Function to minimize: $\min_{W} { \frac{1}{2n_{\text{samples}}} ||X W - Y||_{\text{Fro}}^2 + \alpha \rho ||W||_{2 1} + \frac{\alpha(1-\rho)}{2} ||W||_{\text{Fro}}^2}$
• Uses coordinate descent as the fitting algorithm.

### Elastic-Net (Multitask w/ Cross Validation)¶

• Uses CV to set the alpha and l1_ratio parameters.

### Least Angle Regression (LARS)¶

• Used for high-dimensional data problems - numerically efficient approach.
• Similar to fwd-stepwise regression: it finds the feature most correlated with the target in each step. If multiple features have equal correlation, it proceeds in a direction equiangular between the features.
• Same computational complexity as OLS.
• Returns a full piecewise solution path - useful for cross-validation & tuning.
• Given that it relies on iterative refits of the residuals, LARS can be sensitive to noise.
• Low-level implementations: lars_path and lars_path_gram.

### LARS Lasso¶

• Unlike "normal" LARS, this yields an exact solution.
• Similar to fwd-stepwise regression, but the coefficients are increased in a direction that is equiangular to each one's correlations with the residual.
• Instead of returning a vector, LARS returns a curve representing the solution for each value of the $\ell_1$ norm of the parameter vector.
• coef_path_ (n_features, max_features+1) contains the coefficients path. The first column is always zero.

### example: compute Lasso path vs regularization using the LARS algorithm¶

• Each color = a different feature of the coefficients vector & is plotted vs regularization.