.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "generated/gallery/examples/plot_cfi.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_generated_gallery_examples_plot_cfi.py: Conditional Feature Importance (CFI) on the wine dataset ======================================================== This example demonstrates how to measure feature importance using CFI [:footcite:t:`Chamma_NeurIPS2023`] on the wine dataset. The data are the results of chemical analyses of wines grown in the same region in Italy, derived from three different cultivars. Thirteen features are used to predict three types of wine, making this a 3-class classification problem. In this example, we show how to use CFI to identify which variables are most important for solving the classification task with a neural network classifier. .. GENERATED FROM PYTHON SOURCE LINES 14-21 Loading and preparing the data ------------------------------ We start by loading the dataset and splitting it into training and test sets. This split will be used both for training the classifier and for the CFI method. The CFI method measures the importance of a feature by generating perturbations through sampling from the conditional distribution :math:`p(X^j | X^{-j})`, which is estimated on the training set. .. GENERATED FROM PYTHON SOURCE LINES 21-36 .. code-block:: Python from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split X, y = load_wine(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=0, stratify=y, shuffle=True, ) .. GENERATED FROM PYTHON SOURCE LINES 37-42 Fitting the model and computing CFI feature importance ------------------------------------------------------ To solve the classification task, we use a pipeline that first standardizes the features with StandardScaler, followed by a neural network (MLPClassifier) with one hidden layer of 100 neurons. Before measuring feature importance, we evaluate the estimator's performance by reporting its accuracy score. .. GENERATED FROM PYTHON SOURCE LINES 42-59 .. code-block:: Python from sklearn.neural_network import MLPClassifier from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler clf = make_pipeline( StandardScaler(), MLPClassifier( hidden_layer_sizes=(100), random_state=0, max_iter=500, ), ) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(f"Accuracy: {clf.score(X_test, y_test):.3f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy: 0.989 .. GENERATED FROM PYTHON SOURCE LINES 60-64 Next, we use the CFI class to measure feature importance. Here, we use a RidgeCV model to estimate the conditional expectation :math:`\mathbb{E}[X^j | X^{-j}]`. Since this is a classification task, we use log_loss and specify the "predict_proba" method of our estimator. .. GENERATED FROM PYTHON SOURCE LINES 64-85 .. code-block:: Python from sklearn.linear_model import RidgeCV from sklearn.metrics import log_loss from hidimstat import CFI cfi = CFI( estimator=clf, loss=log_loss, method="predict_proba", imputation_model_continuous=RidgeCV(), random_state=0, ) cfi.fit( X_train, y_train, groups={feat_name: [i] for i, feat_name in enumerate(load_wine().feature_names)}, ) importances = cfi.importance(X_test, y_test) .. GENERATED FROM PYTHON SOURCE LINES 86-89 Visualization of CFI feature importance ---------------------------------------- Finally, we visualize the importance of each feature using a bar plot. .. GENERATED FROM PYTHON SOURCE LINES 89-98 .. code-block:: Python import matplotlib.pyplot as plt _, ax = plt.subplots(figsize=(6, 3)) ax = cfi.plot_importance(ax=ax) ax.set_xlabel("Feature Importance") plt.tight_layout() plt.show() .. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_cfi_001.png :alt: plot cfi :srcset: /generated/gallery/examples/images/sphx_glr_plot_cfi_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 99-105 Variable importance analysis is meant to help scientific understanding, in particular to identify which features are important to differentiate Barolo, Grignolino, and Barbera wine types. Note: Despite very large marginal importance, the features 'flavanoids' and ' total_phenols' are not picked by CFI, probably due to their high correlation (0.86 between these two) and their redundancy with other features. .. GENERATED FROM PYTHON SOURCE LINES 108-111 References ---------- .. footbibliography:: .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 9.039 seconds) .. _sphx_glr_download_generated_gallery_examples_plot_cfi.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_cfi.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_cfi.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_cfi.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_