### Imputation of Missing Values¶

• Many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. They are incompatible with scikit-learn estimators (which assume all values in an array are numerical & have a meaning.)

• You could discard rows or columns with missing values. This comes at the price of losing possibly valuable data. A better strategy is to infer missing data from known data.

### Univariate feature imputation¶

• SimpleImputer does univariate imputation (imputes values in a given dimension using only non-missing data in that dimension.)

• Below: replace np.nan with the mean of the columns (axis 0).

• Sparse matrices are supported.
• Category data (strings or pandas categoricals) is supported with most_frequent or constant options.

### Multivariate Feature Imputation¶

• IterativeImputer models each feature's missing values as a function of other features. It uses a round-robin algorithm for max_iter iterations.

• This estimator is considered experimental.

### Example: Iterative Imputing Variations¶

• Goal: compare estimators to see which is best when evaluating Cal Housing dataset - with a single value randomly removed.

• Estimator options:

• Bayes Ridge (regularized linear regression
• Decision Tree (non-linear regression)
• Extra Trees (similar to missForest in R)
• K Neighbors (compare to KNN approaches)

### Multiple vs Single Imputation¶

• Using multiple imputations to generate $m$ imputations for a single feature matrix is a best practice. Each imputation is put through the analysis pipeline. The $m$ analysis results help you understand how results can vary due to the uncertainty caused by the missing values. This is called multiple imputation.

• IterativeImputer is based on the R MICE package (Multivariate Imputation by Chained Equations), but returns a single imputation instead of multiple imputations. IterativeImputer can also be used for multiple imputations by applying it repeatedly to the same dataset with different random seeds when sample_posterior=True.

• Note: calling IterativeImputers transform method is not allowed to change the number of samples. Therefore multiple imputations cannot be achieved by a single call to transform.

### Nearest Neighbors Imputation¶

• nan_euclidean_distances (a euclidean distance metric that supports missing values) is used to find the NNs. Each missing feature is imputed using values from n_neighbors NN that have a value for the feature.

• The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor.

• If a sample has more than one missing feature, the neighbors for that sample can be different depending on the feature being imputed. When the number of available neighbors is less than n_neighbors and there are no defined distances to the training set, the training set average for that feature is used during imputation.

• If there is at least one neighbor with a defined distance, the weighted or unweighted average of the remaining neighbors will be used during imputation. If a feature is always missing in training, it is removed during transform.

### Marking Imputed Values¶

• MissingIndicator transforms a dataset into a binary matrix which indicates the presence of missing values.

• SimpleImputer and IterativeImputer have an add_indicator option, False by default, which can stack the missing data matrix with imputer's output.

• Nan is the usual placeholder value. missing_values accepts other values, such as integer.

• features chooses the features for building the mask. missing-only is the default setting.
• When using MissingIndicator in a pipeline, be sure to use FeatureUnion or ColumnTransformer to add the indicators to the regular features.

• Create a FeatureUnion & add indicators from MissingIndicator.