### Pipelines & Composite Estimators¶

• Transformers are often combined with classifiers, regressors or other estimators to build a composite estimator.

• Pipeline, the most common tool, is combined with FeatureUnion to concatenate transformer outputs into a composite feature space.

• TransformedTargetRegressor deals with transforming the target (i.e. log-transform y).

### Pipelines¶

• Pipelines chain multiple estimators into one. This enables you to:

• Only call fit & predict once on your dataset.
• Use grid search over the parameters of all the estimators in the chain at once.
• Help prevent test data from leaking into a trained model during cross-validation.
• All elements in the pipeline must be transformers, except the last element (which may be any type - transformer, classifier, etc.)

• Built using (key,value) pairs. key is a step name (string); value is an estimator object.

• When you call fit on a pipeline, you are calling fit on each estimator, transform, then passing the data to the next step.

• Access the list of steps via steps.
• Access a specific step with an [indexvalue].
• Use named_steps to access a step by name.
• Use Python slicing to access sublists.
• Use the <estimator>_<parameter> syntax to access params.
• This is key for doing grid searches.

### Caching¶

• Fitting transformers can be computationally expensive. With its memory parameter set, Pipeline caches each transformer after calling fit. This avoids computing the fit transformers if the parameters and input data are identical.

• A typical example is a grid search where the transformers can be fitted only once and reused for each configuration.

• The memory param caches the transformers. It can be a file directory (string) or a joblib.Memory object.

### Example: Anova-SVM Pipeline¶

• Simple usage of Pipeline that runs successively a univariate feature selection with anova, then an SVM of the selected features.

• Using a sub-pipeline, the fitted coefficients can be mapped back into the original feature space.

### Example: PCA-Logistic Regression Pipeline¶

• PCA for unsupervised dimensionality reduction; logistic regression does the prediction. Use GridSearchCV to set the dimensionality of the PCA.

### Example: Feature map approximation for RBF kernels¶

• Shows how to use RBF Sample and Nystroem to approximate the feature map of an RBF kernel for classification with an SVM.

• Compares results between a linear SVM (original space), linear SVM (approx mapping), and a kernelized SVM.

### Regression Target Transformer¶

• Transforms targets y before fitting a regression model. The predictions are mapped back to the original space via an inverse transform. It takes as an argument the regressor that will be used for prediction, and the transformer that will be applied to the target variable.
• If using simple transformations, you can pass a pair of functions instead. They define the transform and its inverse mapping.
• The functions are verified to be the inverse of each other during each fit. check_inverse=False bypasses this check.

### Example: Target Transforms prior to Regression Learning¶

• Uses two examples (one w/ synthetic data, another w/ Ames housing data.)
• Generate synthetic random regression dataset
• Translate the targets (y) to all non-negative
• Add exponential function to make targets non-linear
• Plot probability density functions before & after applying the log functions.
• Fit linear model to original targets (Ridge regression). Fit should not be accurate due to non-linearity.
• Apply log function to linearize targets, allowing better prediction results.

### Example #2: Ames housing data¶

• Target variable: house sales price.
• Use quantile transformer to normalize targets.
• Apply a Ridge regression (CV'd) model.
• Transformer effect is weaker this time - but it results in higher $R^2$ and decreased MAE.
• Residual plot without transformation returns "reverse smile" shape - residual values varying based on predicted target value.
• Residual plot with transformation is more linear - indicating better fit.

### Feature Unions¶

• Combines a list of transformers into a single object.
• Each transformer is fitted independently.
• Transforms are applied in parallel. Resulting feature matrices are merged side-by-side.
• Feature Unions cannot check whether two transformers return identical features - only produces a union when features are disjoint.
• Built with a list of (key,value) pairs: key = arbitrary name string; value = estimator object.
• Shorthand notation: make_union. (Doesn't require explicit component names.)
• Individual steps can be replaced using set_params, and ignored using drop.

### Column Transformer¶

• Datasets usually contain multiple feature types (floats, integers, text, dates, ...) which require individual preprocessing or feature extraction steps.

• The usual method for doing this is with pandas, which can be problematic. (Test data statistics leaking into cross-validation, needing to include preprocessing params in a search, ...).

• CT enables using per-column transforms in a leakage-safe Pipeline. It works on arrays, sparse matrices & Pandas dataframes.

### Simple Example¶

• Encode cities as a category with One-Hot Encoder
• Apply a Count Vectorizer to titles
• Remaining columns can be ignored (remainder="drop").
• Use a combined name (ex: city_category) to define a multiple-extraction method on a given column.
• Above: Count Vecorizer expects a 1D array - hence the columns were specified with a string (title). One Hot Encoder expects 2D data, so you need to specify the column as a list of strings (['city]).

• Columns can be specified as a list of items, an integer array, a slice, a boolean mask, or by using make_column_selector.

• Keep the remaining columns with remainder='passthrough'. The values are appended to the end of the result.
• remainder can point to an estimator to transform the remaining columns.

### Example Column Transformer¶

• Numeric data is mean-imputed, then standard-scaled.
• Category data: missing data is replaced with 'missing', then one-hot encoded.
• Two column dispatch mechanisms are shown: by column names, by column data types.
• The steps are integrated into a Pipeline with a simple classifier.
• When dealing with a cleaned dataset, the preprocessing can be automatic by using the data types of the column to decide whether to treat a column as a numerical or categorical feature with make_column_selector.

• First, let’s only select a subset of columns to simplify our example.

• embarked and sex are tagged as categories when loading the data with fetch_openml. We can use this information to dispatch the categorical columns to the categorical_transformer and the remaining columns to the numerical_transformer.
• This score doesn't match the previous pipeline's score becase the dtype-based selector treats the pclass columns as a numeric features instead of a category.
• Grid search can be used on the steps defined in the ColumnTransformer, together with the classifier’s hyperparameters as part of the Pipeline.

• Search for the best imputer strategy (for numeric preprocessing) and regularization parameter (for logistic regression) using GridSearchCV.

• The best hyper-parameters have to be used to re-fit a final model on the full training set. We can evaluate that final model on held out test data that was not used for hyperameter tuning.