### Standard Scaling¶

• Many Scikit-learn estimators behave badly if features are normally distributed (Gaussian with zero mean and unit variance) data.

• We often ignore the shape of the distribution and simply center the data by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

• StandardScaler is a quick and easy way to standardize an array-like dataset.

from sklearn import preprocessing import numpy as np X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

scaler = preprocessing.StandardScaler().fit(Xtrain) print(scaler,"\n",scaler.mean,"\n",scaler.scale_)

X_scaled = scaler.transform(X_train) print(X_scaled)

• scaled data has zero mean & unit variance.
• This class implements the Transformer API to compute the mean and standard deviation on a training set, then re-apply the same transformation on the testing set. This class is hence suitable for use in the early steps of a Pipeline.

### Min-Max Scaling and Max Abs Scaling¶

• Another method is to scale between a min & max value (often 0-1), or so that a max absolute value of each feature is scaled to unit size.

• This provides robustness to very small standard deviations of features and preserving zero entries in sparse data.

• The transformer instance can then be applied to new test data unseen during the fit: the same scaling and shifting operations will be applied to be consistent with the transformation performed on the training data.
• Viewing the scaler attributes:
• MaxAbsScaler scales such that training data lies within [-1,+1] by dividing through the largest max value in each feature. It is meant for data already centered at zero, or sparse data.

X_train = np.array([[ 1., -1., 2.], [ 2., 0., 0.], [ 0., 1., -1.]])

max_abs_scaler = preprocessing.MaxAbsScaler() X_train_maxabs = max_abs_scaler.fit_transform(X_train) print(X_train_maxabs)

X_test = np.array([[ -3., -1., 4.]]) X_test_maxabs = max_abs_scaler.transform(X_test) print(X_test_maxabs)

print(max_absscaler.scale)

### Scaling sparse data¶

• Centering sparse data destroys the sparseness - it is rarely is a sensible thing to do. However, it can make help if features are on different scales.

• MaxAbsScaler was designed for scaling sparse data. StandardScaler can also accept scipy.sparse matrix inputs (as long as with_mean=False is used). Otherwise a ValueError will be raised - silently centering would break the sparsity and can crash execution by allocating too much memory unintentionally. RobustScaler cannot be fitted to sparse inputs, but you can use the transform method on sparse inputs.

• Scalers accept both CSR and CSC format (see scipy.sparse.csr_matrix and scipy.sparse.csc_matrix). Any other sparse input will be converted to the CSR format. Choose the CSR or CSC representation upstream to avoid unnecessary memory copies.

• If the centered data is expected to be small, explicitly converting the input to an array using toarray of sparse matrices is another option.

### Scaling with outliers with Robust Scaler¶

• If your data contains many outliers, mean/variance scaling probably will not work very well - use RobustScaler as a drop-in replacement. It uses more robust estimates for the center and range of your data.

### Scaling kernel matrices with KernelCenterer¶

• If you have a matrix of a kernel $K$ that computes a dot product in a feature space defined by function $\phi$, KernelCenterer can transform the matrix so that it contains inner products in the feature space defined by $\phi$ followed by removal of the mean in that space.

### Quantile Transforms¶

• Puts all features into the same distribution based on $G^{-1}(F(X))$ where $F$ is a cumulative distribution function, $G$ the desired output distribution, and $G^{-1}$ is the quantile function of the output function.

• It computes a rank transformation, which smooths out unusual distributions & is more robust to outliers.

• It does, however, distort correlations & distances within/across features.

### Power Mapping to a Gaussian Distribution¶

• Power transforms are a family of parametric, monotonic transformations that map data from any distribution to an approximated Gaussian to stabilize variance and minimize skewness.

• Two transforms are available:

• Yeo-Johnson: $\begin{split}x_i^{(\lambda)} = \begin{cases} [(x_i + 1)^\lambda - 1] / \lambda & \text{if } \lambda \neq 0, x_i \geq 0, \[8pt] \ln{(x_i + 1)} & \text{if } \lambda = 0, x_i \geq 0 \[8pt] -[(-x_i + 1)^{2 - \lambda} - 1] / (2 - \lambda) & \text{if } \lambda \neq 2, x_i < 0, \[8pt] • \ln (- x_i + 1) & \text{if } \lambda = 2, x_i < 0 \end{cases}\end{split}$
• Box-Cox: $\begin{split}x_i^{(\lambda)} = \begin{cases} \dfrac{x_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt] \ln{(x_i)} & \text{if } \lambda = 0, \end{cases}\end{split}$

• Box-Cox can only be applied to positive data.

• Both transforms are controlled via $\lambda$, which is found via max likelihood estimation.

### Example: Map data to Normal Distributions (Box-Cox, Yeo-Johnson)¶

• Power transforms are useful when homoscedasticity & normality are needed.

• Below: Cox-Box & Yeo-Johnson transforms applied to lognormal, chi-squared, weibull, gaussian, uniform and bimodal distributions.

• Success depends on the dataset. Highlights importance of before/after visualization.

• Quantile Transformer forces any arbitrary distribution into a Gaussian if enough samples are available. It is prone to overfitting on small datasets - consider using a power transform instead.

• Quantile Transformer can also map data to a normal distribution with output_distribution='normal'.

### Normalization¶

• Defined as the process of scaling individual samples to have unit norm. This is useful when you use a quadratic method (ex: dot-product) to measure the similarity of any sample pair.

• normalize transforms an array using l1, 'l2or 'max norms.

• Normalizer does the same using the Transformer API - it is therefore useful in pipelines.

• Both accept dense & sparse matrix inputs.

### Categories to Integers¶

• Use OrdinalEncoder to transform category names to 0-n_categories-1.

### Categories to one-of-K ("One Hot")¶

• use OneHotEncoder to transform category names into n_category binary features with one equal to 1, the rest equal to 0.
• Feature values are inferred automatically from the dataset.
• They can also be specified using categories.
• If the training data has missing category features, it's better to specify handle_unknown='ignore' instead of manually setting categories.

• In this approach, unknown categories will be coded with all zeroes.

• drop allows encoding columns into n_categories-1 columns by specifying a category for each feature to be dropped.

• This helps avoid input matrix co-linearity. This is useful for example when using Linear Regression - co-linearity can cause a covariance matrix to be non-invertible.

• Use drop='if_binary' if you want to drop columns for features with 2 categories.
• OneHotEncoder supports missing values by considering them as an additional category:
• If a feature contains both np.nan and None, they will treated separately.

### Quantization, aka Binning¶

KBinsDiscretizer partitions features into $k$ bins.

• The output is, by default, one-hot encoded into a sparse matrix. This is controlled with encode.

• The bin edges are computed during fit and define the intervals (along with the number of bins.)

• Discretization is similar to constructing histograms for continuous data.
• histograms focus on counting features in particular bins.
• discretization assigns feature values to these bins.
• KBinsDiscretizer uses strategy to select a binning strategy.

• ‘uniform’: uses constant-width bins.
• ‘quantile’: uses the quantiles values to have equally populated bins.
• ‘kmeans’: bins based on independent k-means clustering for each feature.
• You can specify custom bins with a callable (in this case, pandas.cut) to FunctionTransformer.

### Example: Binning Continuous Features with KBinsDiscretizer¶

• Compare predictions of linear regression (LR) and decision tree (DT), with and without discretization of real-valued features.

• LRs are easy to build & interpret, but can only model linear relationships. DTs can build a more complex models.

• Binning is one way to make LRs more powerful on continuous data. If the bins are not reasonably wide, there is an increased risk of overfitting - so the discretizer should be tuned under with validation.

• After binning, LR & DT make exactly the same prediction. As features are constant within each bin, any model must predict the same value for all points within a bin.

• After binning, LR become much more flexible; DT gets much less flexible. Binning features usually have no benefit for DTs - these models can learn to split up the data anywhere.

### Example: Feature discretization¶

• Feature discretization decomposes each feature into a set of equally distributed (width) bins. The values are one-hot encoded and given to a linear classifier. The preprocessing enables modeling non-linear behavior even though the classifier is linear.

• The first two rows represent linearly non-separable datasets (moons and concentric circles); the third is approximately linearly separable.

• Feature discretization increases the linear classifier performance on the non-separable datasets, but decreases performance on the third. Two non-linear classifiers are also shown for comparison.

• This is not a great example - the intuition conveyed does not carry over to real datasets.

• High-D data can more easily be separated linearly.
• Feature discretization and one-hot encoding increases the number of features, which easily lead to overfitting when the number of samples is small.
• Plots: training points = solid colors; testing points = semi-transparent. The lower right shows the classification accuracy on the test set.