.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "generated/gallery/examples/plot_cfi.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_generated_gallery_examples_plot_cfi.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_generated_gallery_examples_plot_cfi.py:


Conditional Feature Importance (CFI) on the wine dataset
========================================================

This example demonstrates how to measure feature importance using CFI [:footcite:t:`Chamma_NeurIPS2023`] on the wine dataset.
The data are the results of chemical analyses of wines grown in the same region in Italy,
derived from three different cultivars. Thirteen features are used to predict three types
of wine, making this a 3-class classification problem. In this example, we show how to
use CFI to identify which variables are most important for solving the classification
task with a neural network classifier.

.. GENERATED FROM PYTHON SOURCE LINES 14-21

Loading and preparing the data
------------------------------
We start by loading the dataset and splitting it into training and test sets.
This split will be used both for training the classifier and for the CFI method.
The CFI method measures the importance of a feature by generating perturbations
through sampling from the conditional distribution :math:`p(X^j | X^{-j})`,
which is estimated on the training set.

.. GENERATED FROM PYTHON SOURCE LINES 21-36

.. code-block:: Python


    from sklearn.datasets import load_wine
    from sklearn.model_selection import train_test_split

    X, y = load_wine(return_X_y=True)

    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.5,
        random_state=0,
        stratify=y,
        shuffle=True,
    )


.. GENERATED FROM PYTHON SOURCE LINES 37-42

Fitting the model and computing CFI feature importance
------------------------------------------------------
To solve the classification task, we use a pipeline that first standardizes the features with StandardScaler,
followed by a neural network (MLPClassifier) with one hidden layer of 100 neurons.
Before measuring feature importance, we evaluate the estimator's performance by reporting its accuracy score.

.. GENERATED FROM PYTHON SOURCE LINES 42-59

.. code-block:: Python


    from sklearn.neural_network import MLPClassifier
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import StandardScaler

    clf = make_pipeline(
        StandardScaler(),
        MLPClassifier(
            hidden_layer_sizes=(100),
            random_state=0,
            max_iter=500,
        ),
    )

    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(f"Accuracy: {clf.score(X_test, y_test):.3f}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Accuracy: 0.989


.. GENERATED FROM PYTHON SOURCE LINES 60-64

Next, we use the CFI class to measure feature importance. Here, we use a RidgeCV
model to estimate the conditional expectation :math:`\mathbb{E}[X^j | X^{-j}]`.
Since this is a classification task, we use log_loss and specify the "predict_proba"
method of our estimator.

.. GENERATED FROM PYTHON SOURCE LINES 64-85

.. code-block:: Python


    from sklearn.linear_model import RidgeCV
    from sklearn.metrics import log_loss

    from hidimstat import CFI

    cfi = CFI(
        estimator=clf,
        loss=log_loss,
        method="predict_proba",
        imputation_model_continuous=RidgeCV(),
        random_state=0,
    )
    cfi.fit(
        X_train,
        y_train,
        groups={feat_name: [i] for i, feat_name in enumerate(load_wine().feature_names)},
    )
    importances = cfi.importance(X_test, y_test)


.. GENERATED FROM PYTHON SOURCE LINES 86-89

Visualization of CFI feature importance
----------------------------------------
Finally, we visualize the importance of each feature using a bar plot.

.. GENERATED FROM PYTHON SOURCE LINES 89-98

.. code-block:: Python


    import matplotlib.pyplot as plt

    _, ax = plt.subplots(figsize=(6, 3))
    ax = cfi.plot_importance(ax=ax)
    ax.set_xlabel("Feature Importance")
    plt.tight_layout()
    plt.show()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_cfi_001.png
   :alt: plot cfi
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_cfi_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 99-105

Variable importance analysis is meant to help scientific understanding, in particular
to identify which features are important to differentiate Barolo, Grignolino, and
Barbera wine types.
Note: Despite very large marginal importance, the features 'flavanoids' and '
total_phenols' are not picked by CFI, probably due to their high correlation
(0.86 between these two) and their redundancy with other features.

.. GENERATED FROM PYTHON SOURCE LINES 108-111

References
----------
.. footbibliography::


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 9.039 seconds)


.. _sphx_glr_download_generated_gallery_examples_plot_cfi.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_cfi.ipynb <plot_cfi.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_cfi.py <plot_cfi.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_cfi.zip <plot_cfi.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_