.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "generated/gallery/examples/plot_knockoffs_wisconsin.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_generated_gallery_examples_plot_knockoffs_wisconsin.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_generated_gallery_examples_plot_knockoffs_wisconsin.py:


Controlled multiple variable selection on the Wisconsin breast cancer dataset
=============================================================================

In this example, we explore the basics of variable selection and illustrate the need to
statistically control the amount of faslely selected variables. We compare two variable
selection methods: the Lasso and the Model-X Knockoffs :footcite:t:`candes2018panning`.
We show how the Lasso is not robust to the presence of irrelevant variables, while the
Knockoffs (KO) method is able to address this issue.

.. GENERATED FROM PYTHON SOURCE LINES 13-19

Load the breast cancer dataset
------------------------------
There are 569 samples and 30 features that correspond to tumor attributes.
The downstream task is to classify tumors as benign or malignant. We leave out 10% of
the data to evaluate the performance of the Logistic Lasso (Logistic Regression with
L1 regularization) on the prediction task.

.. GENERATED FROM PYTHON SOURCE LINES 19-37

.. code-block:: Python

    from sklearn.datasets import load_breast_cancer
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler

    data = load_breast_cancer()
    X = data.data
    y = data.target

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    n_train, p = X_train.shape
    n_test = X_test.shape[0]
    feature_names = [str(name) for name in data.feature_names]


.. GENERATED FROM PYTHON SOURCE LINES 38-44

Selecting variables with the Logistic Lasso
-------------------------------------------
We want to select variables that are relevant to the outcome, i.e. tumor
characteristics that are associated with tumor malignance. We start off by applying a
classical method using Lasso logistic regression and retaining variables with non-null
coefficients:

.. GENERATED FROM PYTHON SOURCE LINES 44-63

.. code-block:: Python


    import numpy as np
    from sklearn.linear_model import LogisticRegressionCV

    clf = LogisticRegressionCV(
        Cs=np.logspace(-3, 3, 10), penalty="l1", solver="liblinear", random_state=0
    )
    clf.fit(X_train, y_train)
    print(f"Accuracy of Lasso on test set: {clf.score(X_test, y_test):.3f}")


    selected_lasso = np.where(np.abs(clf.coef_[0]) > 1e-6)[0]
    print(f"The Lasso selects {len(selected_lasso)} variables:")
    print(f"{'Variable name':<30} | {'Coefficient':>10}")
    print("-" * 45)
    for i in selected_lasso:
        print(f"{feature_names[i]:<30} | {clf.coef_[0][i]:>10.3f}")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Accuracy of Lasso on test set: 1.000
    The Lasso selects 8 variables:
    Variable name                  | Coefficient
    ---------------------------------------------
    mean concave points            |     -0.439
    radius error                   |     -0.570
    worst radius                   |     -2.274
    worst texture                  |     -0.603
    worst smoothness               |     -0.223
    worst concavity                |     -0.061
    worst concave points           |     -0.937
    worst symmetry                 |     -0.148


.. GENERATED FROM PYTHON SOURCE LINES 64-72

Evaluating the rejection set
----------------------------
Since we do not have the ground truth for the selected variables (i.e. we do not know
the relationship between the tumor characteristics and the the malignance of the tumor
), we cannot evaluate this selection set directly. To investigate the reliability of
this method, we artificially increase the number of variables by adding noisy copies
of the features. These are correlated with the variables in the dataset, but are not
related to the outcome.

.. GENERATED FROM PYTHON SOURCE LINES 72-94

.. code-block:: Python


    # Define the seeds for the reproducibility of the example
    rng = np.random.default_rng(0)


    repeats_noise = 5  # Number of synthetic noisy sets to add

    noises_train = [X_train]
    noises_test = [X_test]
    feature_names_noise = [x for x in feature_names]
    for k in range(repeats_noise):
        X_train_c = X_train.copy()
        X_test_c = X_test.copy()
        noises_train.append(X_train_c + 2 * rng.standard_normal((n_train, p)))
        noises_test.append(X_test_c + 2 * rng.standard_normal((n_test, p)))
        feature_names_noise += [f"spurious #{k*p+i}" for i in range(p)]

    noisy_train = np.concatenate(noises_train, axis=1)
    noisy_test = np.concatenate(noises_test, axis=1)


.. GENERATED FROM PYTHON SOURCE LINES 95-98

There are 180 features, 30 of them are real and 150 of them are fake and independent
of the outcome. We now apply the Lasso (with cross-validation to select the best
regularization parameter) to the noisy dataset and observe the results:

.. GENERATED FROM PYTHON SOURCE LINES 98-134

.. code-block:: Python


    import pandas as pd

    lasso_noisy = LogisticRegressionCV(
        Cs=np.logspace(-3, 3, 10),
        penalty="l1",
        solver="liblinear",
        random_state=0,
        n_jobs=4,
    )
    lasso_noisy.fit(noisy_train, y_train)
    y_pred_noisy = lasso_noisy.predict(noisy_test)
    print(
        (
            "Accuracy of Lasso on test set with noise: "
            f"{lasso_noisy.score(noisy_test, y_test):.3f}"
        )
    )

    selected_mask = [
        "selected" if np.abs(x) > 1e-6 else "not selected" for x in lasso_noisy.coef_[0]
    ]
    df_lasso_noisy = pd.DataFrame(
        {
            "score": np.abs(lasso_noisy.coef_[0]),
            "variable": feature_names_noise,
            "selected": selected_mask,
        }
    )
    # Count how many selected features are actually noise
    num_false_discoveries = np.sum(
        np.array(selected_mask[p:]) == "selected"
    )  # Count the number of selected spurious variables
    print(f"The Lasso makes at least {num_false_discoveries} False Discoveries!!")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Accuracy of Lasso on test set with noise: 0.965
    The Lasso makes at least 42 False Discoveries!!


.. GENERATED FROM PYTHON SOURCE LINES 135-139

The Lasso selects many spurious variables that are not directly related to the outcome.
To mitigate this problem, we can use one of the statistically controlled variable
selection methods implemented in hidimstat. This ensures that the proportion of False
Discoveries is below a certain bound set by the user in all scenarios.

.. GENERATED FROM PYTHON SOURCE LINES 142-147

Controlled variable selection with Knockoffs
--------------------------------------------
We use the Model-X Knockoff procedure to control the FDR (False Discovery Rate). The
selection of variables is based on the Lasso Coefficient Difference (LCD) statistic
:footcite:t:`candes2018panning`.

.. GENERATED FROM PYTHON SOURCE LINES 147-178

.. code-block:: Python

    from sklearn.covariance import LedoitWolf

    from hidimstat import ModelXKnockoff
    from hidimstat.samplers import GaussianKnockoffs

    model_x_knockoff = ModelXKnockoff(
        ko_generator=GaussianKnockoffs(
            cov_estimator=LedoitWolf(assume_centered=True), tol=1e-15
        ),
        estimator=LogisticRegressionCV(
            solver="liblinear",
            penalty="l1",
            Cs=np.logspace(-3, 3, 10),
            random_state=0,
            tol=1e-3,
            max_iter=1000,
        ),
        random_state=0,
        preconfigure_lasso_path=False,
    )
    importance = model_x_knockoff.fit_importance(
        noisy_train,
        y_train,
    )
    selected = model_x_knockoff.fdr_selection(fdr=0.2)


    # Count how many selected features are actually noise
    num_false_discoveries = np.sum(selected[p:])
    print(f"Knockoffs make at least {num_false_discoveries} False Discoveries")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/circleci/project/src/hidimstat/samplers/gaussian_knockoffs.py:205: UserWarning: The equi-correlated matrix for knockoffs is not positive definite. Reduce the value of distance by 2.220446049250313e-16.
      warnings.warn(
    /home/circleci/project/src/hidimstat/samplers/gaussian_knockoffs.py:205: UserWarning: The equi-correlated matrix for knockoffs is not positive definite. Reduce the value of distance by 2.220446049250313e-15.
      warnings.warn(
    /home/circleci/project/src/hidimstat/samplers/gaussian_knockoffs.py:205: UserWarning: The equi-correlated matrix for knockoffs is not positive definite. Reduce the value of distance by 2.220446049250313e-14.
      warnings.warn(
    Knockoffs make at least 2 False Discoveries


.. GENERATED FROM PYTHON SOURCE LINES 179-186

Visualizing the results
-----------------------
We can compare the selection sets obtained by the two methods. In addition to the
binary selection (selected or not), we can also visualize the the KO statistic
along with the selection threshold for the knockoffs and the absolute value of the
Lasso coefficients. We plot the 25 most important features according to the KO
statistic.

.. GENERATED FROM PYTHON SOURCE LINES 186-243

.. code-block:: Python

    import matplotlib.pyplot as plt
    import seaborn as sns

    selected_mask = np.array(["not selected"] * len(importance[0]))
    selected_mask[selected] = "selected"
    df_ko = pd.DataFrame(
        {
            "score": importance[0],
            "variable": feature_names_noise,
            "selected": selected_mask,
        }
    )
    df_ko = df_ko.sort_values(by="score", ascending=False).head(25)

    fig, axes = plt.subplots(
        1,
        2,
        sharey=True,
    )
    ax = axes[0]
    sns.scatterplot(
        data=df_ko,
        x="score",
        y="variable",
        hue="selected",
        ax=ax,
        palette={"selected": "tab:red", "not selected": "tab:gray"},
    )
    ax.axvline(
        x=model_x_knockoff.threshold_fdr_, color="k", linestyle="--", label="Threshold"
    )
    ax.legend()
    ax.set_xlabel("KO statistic (LCD)")
    ax.set_ylabel("")
    ax.set_title("Knockoffs", fontweight="bold")

    ax = axes[1]
    sns.scatterplot(
        data=df_lasso_noisy[df_lasso_noisy["variable"].isin(df_ko["variable"])],
        x="score",
        y="variable",
        hue="selected",
        ax=ax,
        palette={"selected": "tab:red", "not selected": "tab:gray"},
        legend=False,
    )
    ax.set_xlabel("$|\\hat{\\beta}|$")
    ax.axvline(
        x=0,
        color="k",
        linestyle="--",
    )
    ax.set_title("Lasso", fontweight="bold")
    plt.tight_layout()
    plt.show()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_knockoffs_wisconsin_001.png
   :alt: Knockoffs, Lasso
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_knockoffs_wisconsin_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 244-247

The plot shows that the knockoffs procedure is more conservative than the Lasso,
selecting far fewer spurious features. At the same time, it successfully identifies
relevant true features.

.. GENERATED FROM PYTHON SOURCE LINES 250-253

References
----------
.. footbibliography::


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 5.991 seconds)

**Estimated memory usage:**  215 MB


.. _sphx_glr_download_generated_gallery_examples_plot_knockoffs_wisconsin.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_knockoffs_wisconsin.ipynb <plot_knockoffs_wisconsin.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_knockoffs_wisconsin.py <plot_knockoffs_wisconsin.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_knockoffs_wisconsin.zip <plot_knockoffs_wisconsin.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_