### Permutation Feature Importance (PFI)¶

• PFI can be used for any fitted estimator when the data is tabular. This is especially useful for non-linear or opaque estimators.

• It is defined to be the decrease in a model score when a single feature value is randomly shuffled. The procedure breaks the relationship between the feature and the target, thus a drop in the model score indicates how much the model depends on the feature.

• This technique is model agnostic and can be calculated many times with different permutations.

• permutation_importance calculates the feature importance of estimators for a given dataset. n_repeats controls the #times a feature is randomly shuffled and returns a sample of feature importances.

• Validation performance ($R2$ score) is significantly larger than chance. This enables using permutation_importance to learn which features are most predictive:
• Permutation importances can be computed on a training set or a held-out testing or validation set. Using a held-out set makes it possible to highlight which features contribute the most to the generalization power of the inspected model. Features that are important on the training set but not on the held-out set might cause the model to overfit.

### Tree-based models: Impurity-based vs Permutation-based Performance¶

• Tree-based models provide an alternative measure of feature importances based on the mean decrease in impurity (MDI).

• Impurity is quantified by the splitting criterion of the decision trees (Gini, Entropy or Mean Squared Error). However, this method can give high importance to features that may not be predictive on unseen data when the model is overfitting.

• Permutation-based feature importance avoids this issue, since it can be computed on unseen data.

• Furthermore, impurity-based feature importance for trees are strongly biased and favor high cardinality (numerical features) over low cardinality (binary or categorical) variables with a small number of possible categories.

• Permutation-based feature importances do not exhibit such a bias. Additionally, the permutation feature importance may be computed performance metric on the model predictions predictions and can be used to analyze any model class (not just tree-based models).

• The following example highlights the limitations of impurity-based feature importance in contrast to permutation-based feature importance: Permutation Importance vs Random Forest Feature Importance (MDI).

• How to apply separate preprocessing on different feature types. Two random variables added that are not correlated with the target variable, survived:

• random_num: high-cardinality (many unique values) numerical variable

• random_cat: low-cardinality (3 poss. values) categorical variable
• Is the model's predictive ability sufficient? (Otherwise, why bother?)
• Result: Random Forest has enough capacity to completely memorize the training set, and can still generalize reasonably well.

• You could trade training accuracy for test set accuracy by tweaking tree capacity (min_sample_leaf to 5 or 10, for example). This could limit overfit while not introducing too much underfit.

• Impurity-based feature importance ranks numerical variables as most important - so random_num gets top billing.

• This is because:

• Impurity-based importance is biased towards high-cardinality features.
• Permutation importance, however, is computed on a reserved test set. In this case sex (a low-cardinality feature) is deemed most important.

• Also notice the two random features have very low scores.

• You can also compute a permutation importances on the training set.

• This shows random_num gets a higher importance ranking. The difference is a confirmation that the RF model has enough capacity to use that random numerical feature to overfit.

• You can further confirm this by re-running this example with a constrained RF with min_samples_leaf=10.

### Example: Permutation Importance - Correlated Features¶

• When two features are correlated, and one is permuted, the model still has access to the feature through the correlated one. This can result in lower performances for both - even though they might be important.

• One way to handle this is to cluster correleated features, and only keep one feature from each cluster.

• The Wisconsin breast cancer data set contains multicollinear features - so permutation importance says none of the features are important.
• A PI plot suggests none of the features are important, which contradicts the high test accuracy (something here must be important).
• When features are collinear, permutating one feature will have little effect on the models performance because it can get the same information from a correlated feature.

• One way to handle this is to apply hierarchical clustering on the Spearman rank-order correlations, picking a threshold, and keeping a single feature from each cluster. First, we plot a heatmap of the correlated features.

• Pick a threshold from a visual inspection of the dendrogram. Group features into clusters, choose a feature from each, select those features from the dataset, and train a new random forest.

• The test accuracy on the new dataset shouldn't change much.