Ungraded lab: Permutation Feature Importance


Welcome, during this ungraded lab you are going to be perform Permutation Feature Importance on the wine dataset using scikit-learn. In particular you will:

  1. Train a Random Forest classifier on the data.
  2. Compute the feature importance score by permutating each feature.
  3. Re-train the model with only the top features.
  4. Check other classifiers for comparison.

Let's get started!

Inspect and pre-process the data

Begin by upgrading scikit-learn to the latest version:

Now import the required dependencies and load the dataset:

This dataset is made up of 13 numerical features and there are 3 different classes of wine.

Now perform the train/test split and normalize the data using StandardScaler:

Train the classifier

Now you will fit a Random Forest classifier with 10 estimators and compute the mean accuracy achieved:

This model achieved a mean accuracy of 91%. Pretty good for a model without fine tunning.

Permutation Feature Importance

To perform the model inspection technique known as Permutation Feature Importance you will use scikit-learn's built-in function permutation_importance.

You will create a function that given a classifier, features and labels computes the feature importance for every feature:

The importance score is computed in a way that higher values represent better predictive power. To know exactly how it is computed check out this link.

Now use the feature_importance function on the Random Forest classifier and the train set:

Looks like many of the features have a fairly low importance score. This points that the predictive power of this dataset is conmdensed in a few features.

However it is important to notice that this process was done for the training set, so this feature importance does NOT have into account if the feature might help with the generalization power of the model.

To check this, repeat the process for the test set:

Notice that the top most important features are the same for both sets. However features such as alcohol, which was considered not important for the training set is much more important when using the testing set. This hints that this feature will contribute to the generalization power of the model.

If a feature is deemed as important for the train set but not for the testing, this feature will probably cause the model to overfit.

Re-train the model with the most important features

Now you will re-train the Random Forest classifier with only the top 3 most important features.

In this case they are the same for both sets:

Notice that by using only the 3 most important features the model achieved a mean accuracy even higher than the one using all 13 features.

Remember that the alcohol feature was deemed not important in the train split but you had the hypotheses that it had important information for the generalization of the model.

Add this feature and see how the model performs:

Wow! By adding this additional feature you know get a mean accuracy of 100%! Quite remarkable! Looks like this feature did in fact provided some important information that helped the model do a better job at generalizing.

Try out other classifiers

The process of Permutation Feature Importance is also dependant on the classifier you are using. Since different classifiers follow different rules for classification it is natural to assume they will consider different features to be important or unimportant.

To test this, try out other classifiers:

Looks like flavanoids and proline are very important across all classifiers. However there is variability from one classifier to the others on what features are considered the most important ones.


Congratulations on finishing this ungraded lab! Now you should have a clearer understanding of what Permutation Feature Importance is, why it is useful and how to implement this technique using scikit-learn.

Keep it up!