.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "generated/gallery/examples/plot_model_agnostic_importance.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_generated_gallery_examples_plot_model_agnostic_importance.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_generated_gallery_examples_plot_model_agnostic_importance.py:


Variable Selection Under Model Misspecification
=============================================================

In this example, we illustrate the limitations of variable selection methods based on
linear models using the circles dataset. We first use the distilled conditional
randomization test (d0CRT), which is based on linear models :footcite:t:`liu2022fast` and then
demonstrate how model-agnostic methods, such as Leave-One-Covariate-Out (LOCO), can
identify important variables even when classes are not linearly separable.

To evaluate the importance of a variable, LOCO re-fits a sub-model using a subset of the
data where the variable of interest is removed. The importance of the variable is
quantified as the difference in loss between the full model and the sub-model. As shown
in :footcite:t:`williamson_2021_nonparametric` , this loss difference can be interpreted as an
unnormalized generalized ANOVA (difference of R²).  Denoting :math:`\mu` the predictive
model used, :math:`\mu_{-j}` the sub-model where the j-th variable is removed, and
:math:`X^{-j}` the data with the j-th variable removed, the loss difference can be
expressed as:

.. math::
    \psi_{j} = \mathbb{V}(y) \left[ \left[ 1 - \frac{\mathbb{E}[(y - \mu(X))^2]}{\mathbb{V}(y)} \right] - \left[ 1 - \frac{\mathbb{E}[(y - \mu_{-j}(X^{-j}))^2]}{\mathbb{V}(y)} \right] \right]

where :math:`\psi_{j}` is the LOCO importance of the j-th variable.

.. GENERATED FROM PYTHON SOURCE LINES 27-29

Generate data where classes are not linearly separable
------------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 29-57

.. code-block:: Python


    import matplotlib.pyplot as plt
    import numpy as np
    import seaborn as sns
    from sklearn.datasets import make_circles

    rng = np.random.default_rng(0)
    X, y = make_circles(
        n_samples=500,
        noise=0.1,
        factor=0.6,
        random_state=np.random.RandomState(rng.bit_generator),
    )


    fig, ax = plt.subplots()
    sns.scatterplot(
        x=X[:, 0],
        y=X[:, 1],
        hue=y,
        ax=ax,
        palette="muted",
    )
    ax.legend(title="Class")
    ax.set_xlabel("X1")
    ax.set_ylabel("X2")
    plt.show()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_model_agnostic_importance_001.png
   :alt: plot model agnostic importance
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_model_agnostic_importance_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 58-60

Define a linear and a non-linear estimator
------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 60-71

.. code-block:: Python


    from sklearn.linear_model import LogisticRegressionCV
    from sklearn.svm import SVC

    non_linear_model = SVC(kernel="rbf", random_state=0)
    linear_model = LogisticRegressionCV(
        penalty="l1",
        solver="liblinear",
        max_iter=1000,
    )


.. GENERATED FROM PYTHON SOURCE LINES 72-78

Compute p-values using d0CRT
----------------------------
We first compute the p-values using d0CRT which performs a conditional independence
test (:math:`H_0: X_j \perp\!\!\!\perp y | X_{-j}`) for each variable. However,
this test is based on a linear model (LogisticRegression) and fails to reject the null
in the presence of non-linear relationships.

.. GENERATED FROM PYTHON SOURCE LINES 78-98

.. code-block:: Python


    from sklearn.base import clone

    from hidimstat import D0CRT

    d0crt_linear = D0CRT(
        estimator=clone(linear_model), screening_threshold=None, random_state=0
    )
    d0crt_linear.fit_importance(X, y)
    pval_dcrt_linear = d0crt_linear.pvalues_
    print(f"{pval_dcrt_linear=}")

    d0crt_non_linear = D0CRT(
        estimator=clone(non_linear_model), screening_threshold=None, random_state=0
    )
    d0crt_non_linear.fit_importance(X, y)
    pval_dcrt_non_linear = d0crt_non_linear.pvalues_
    print(f"{pval_dcrt_non_linear=}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/circleci/project/.venv/lib/python3.13/site-packages/sklearn/linear_model/_coordinate_descent.py:1622: FutureWarning: 'n_alphas' was deprecated in 1.7 and will be removed in 1.9. 'alphas' now accepts an integer value which removes the need to pass 'n_alphas'. The default value of 'alphas' will change from None to 100 in 1.9. Pass an explicit value to 'alphas' and leave 'n_alphas' to its default value to silence this warning.
      warnings.warn(
    /home/circleci/project/.venv/lib/python3.13/site-packages/sklearn/linear_model/_coordinate_descent.py:1622: FutureWarning: 'n_alphas' was deprecated in 1.7 and will be removed in 1.9. 'alphas' now accepts an integer value which removes the need to pass 'n_alphas'. The default value of 'alphas' will change from None to 100 in 1.9. Pass an explicit value to 'alphas' and leave 'n_alphas' to its default value to silence this warning.
      warnings.warn(
    pval_dcrt_linear=array([0.79358069, 0.96922984])
    /home/circleci/project/.venv/lib/python3.13/site-packages/sklearn/linear_model/_coordinate_descent.py:1622: FutureWarning: 'n_alphas' was deprecated in 1.7 and will be removed in 1.9. 'alphas' now accepts an integer value which removes the need to pass 'n_alphas'. The default value of 'alphas' will change from None to 100 in 1.9. Pass an explicit value to 'alphas' and leave 'n_alphas' to its default value to silence this warning.
      warnings.warn(
    /home/circleci/project/.venv/lib/python3.13/site-packages/sklearn/linear_model/_coordinate_descent.py:1622: FutureWarning: 'n_alphas' was deprecated in 1.7 and will be removed in 1.9. 'alphas' now accepts an integer value which removes the need to pass 'n_alphas'. The default value of 'alphas' will change from None to 100 in 1.9. Pass an explicit value to 'alphas' and leave 'n_alphas' to its default value to silence this warning.
      warnings.warn(
    pval_dcrt_non_linear=array([0.75886965, 0.4576556 ])


.. GENERATED FROM PYTHON SOURCE LINES 99-106

Compute p-values using LOCO
---------------------------
We then compute the p-values using LOCO
with a linear, and then a non-linear model. When using a
misspecified model, such as a linear model for this dataset, LOCO fails to reject the null
similarly to d0CRT. However, when using a non-linear model (SVC), LOCO is able to
identify the important variables.

.. GENERATED FROM PYTHON SOURCE LINES 106-141

.. code-block:: Python


    from sklearn.metrics import hinge_loss, log_loss
    from sklearn.model_selection import KFold

    from hidimstat import LOCO

    cv = KFold(n_splits=5, shuffle=True, random_state=0)

    importances_linear = []
    importances_non_linear = []
    for train, test in cv.split(X):
        non_linear_model_ = clone(non_linear_model)
        linear_model_ = clone(linear_model)
        non_linear_model_.fit(X[train], y[train])
        linear_model_.fit(X[train], y[train])

        vim_linear = LOCO(
            estimator=linear_model_,
            loss=log_loss,
            method="predict_proba",
            n_jobs=2,
        )
        vim_non_linear = LOCO(
            estimator=non_linear_model_,
            loss=hinge_loss,
            method="decision_function",
            n_jobs=2,
        )
        vim_linear.fit(X[train], y[train])
        vim_non_linear.fit(X[train], y[train])

        importances_linear.append(vim_linear.importance(X[test], y[test]))
        importances_non_linear.append(vim_non_linear.importance(X[test], y[test]))


.. GENERATED FROM PYTHON SOURCE LINES 142-144

To select variables using LOCO, we compute the p-values using a t-test over the
importance scores.

.. GENERATED FROM PYTHON SOURCE LINES 144-161

.. code-block:: Python


    from scipy.stats import ttest_1samp

    _, pval_linear = ttest_1samp(
        importances_linear,
        0,
        axis=0,
        alternative="greater",
    )
    _, pval_non_linear = ttest_1samp(
        importances_non_linear, 0, axis=0, alternative="greater"
    )

    print(f"{pval_linear=}")
    print(f"{pval_non_linear=}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    pval_linear=array([0.79043471, 0.39557695])
    pval_non_linear=array([9.45765879e-05, 6.55666251e-04])


.. GENERATED FROM PYTHON SOURCE LINES 162-164

Plot the :math:`-log_{10}(pval)` for each method and variable
-------------------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 164-187

.. code-block:: Python


    import pandas as pd

    df_pval = pd.DataFrame(
        {
            "pval": np.hstack(
                [
                    pval_dcrt_linear,
                    pval_dcrt_non_linear,
                    pval_linear,
                    pval_non_linear,
                ]
            ),
            "method": ["d0CRT-linear"] * 2
            + ["d0CRT-non-linear"] * 2
            + ["LOCO-linear"] * 2
            + ["LOCO-non-linear"] * 2,
            "Feature": ["X1", "X2"] * 4,
        }
    )
    df_pval["minus_log10_pval"] = -np.log10(df_pval["pval"])


.. GENERATED FROM PYTHON SOURCE LINES 188-190

Plot the :math:`-log_{10}(pval)` for each method and variable
-------------------------------------------------------------

.. GENERATED FROM PYTHON SOURCE LINES 190-213

.. code-block:: Python


    fig, ax = plt.subplots()
    sns.barplot(
        data=df_pval,
        y="Feature",
        x="minus_log10_pval",
        hue="method",
        palette="muted",
        ax=ax,
    )
    ax.set_xlabel("-$\\log_{10}(pval)$")
    ax.axvline(
        -np.log10(0.05),
        color="k",
        lw=3,
        linestyle="--",
        label="-$\\log_{10}(0.05)$",
    )
    ax.legend()
    plt.show()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_model_agnostic_importance_002.png
   :alt: plot model agnostic importance
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_model_agnostic_importance_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 214-219

As expected, when using linear models (d0CRT and LOCO-linear) that are misspecified,
the variables are not selected. This highlights the benefit of using model-agnostic
methods such as LOCO, which allows for the use of models that are expressive enough
to explain the data. While d0CRT can use any estimator, its distillation step
restricts it from capturing variable interactions.

.. GENERATED FROM PYTHON SOURCE LINES 222-225

References
----------
.. footbibliography::


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 5.079 seconds)

**Estimated memory usage:**  215 MB


.. _sphx_glr_download_generated_gallery_examples_plot_model_agnostic_importance.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_model_agnostic_importance.ipynb <plot_model_agnostic_importance.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_model_agnostic_importance.py <plot_model_agnostic_importance.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_model_agnostic_importance.zip <plot_model_agnostic_importance.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_