.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "generated/gallery/examples/plot_knockoff_aggregation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_generated_gallery_examples_plot_knockoff_aggregation.py: Knockoff aggregation ==================== The examples shows how to use aggregate model-X Knockoff selections to derandomize inference. The model-X Knockoff introduced by :footcite:t:`candes2018panning` allows for variable selection with statistical guarantees on the False Discovery Rate (FDR), .. math:: FDR = \mathbb{E}[ FDP ] = \mathbb{E} \left[ \frac{|\hat{S} \cap \mathcal{H}_0 | }{| \hat{S} |} \right] where :math:`\hat{S}` is the set of selected variables and :math:`\mathcal{H}_0` is the set of null variables (i.e., variables with no effect on the response). A notable drawback of this procedure is the randomness associated with generating knockoff variables, :math:`\tilde{X}`. This can result in fluctuations of the statistical power and false discovery proportion, and consequently, unstable inference. To mitigate this issue, several aggregation procedures have been proposed in the literature. :footcite:t:`pmlr-v119-nguyen20a` introduces a quantile aggregation procedure based on the p-values obtained from multiple independent runs of the knockoff filter. Or :footcite:t:`Ren_2023` proposes an aggregation procedure based on e-values. We illustrate both procedures in this example. .. GENERATED FROM PYTHON SOURCE LINES 26-32 Generating data --------------- We use a simulated dataset where we know the ground truth to evaluate the performance, in terms of statistical power and false discovery proportion, of the different aggregation procedures. We generate data with `n=300` samples and `p=100` correlated features. .. GENERATED FROM PYTHON SOURCE LINES 32-57 .. code-block:: Python import numpy as np from hidimstat._utils.scenario import multivariate_simulation n_features = 100 n_samples = 300 # Correlation rho = 0.5 # Sparsity of the support sparsity = 0.5 # Signal-to-noise ratio snr = 10 # Generate data X, y, beta_true, noise = multivariate_simulation( n_samples=n_samples, n_features=n_features, rho=rho, support_size=int(n_features * sparsity), signal_noise_ratio=snr, seed=0, ) .. GENERATED FROM PYTHON SOURCE LINES 58-65 Inference with model-X Knockoffs -------------------------------- We repeat the model-X Knockoff procedure multiple times, as controlled by the `n_repeats` parameter, to obtain different selections. This will allow us to observe the variability of the selections induced by the knockoff lottery. Then, we compare the possible solutions to aggregate the selections in order to derandomize the inference. .. GENERATED FROM PYTHON SOURCE LINES 65-87 .. code-block:: Python from hidimstat.knockoffs import ModelXKnockoff from hidimstat.statistical_tools.multiple_testing import fdp_power fdr = 0.1 n_repeats = 25 n_jobs = 4 model_x_knockoff = ModelXKnockoff(n_repeats=n_repeats, n_jobs=n_jobs, random_state=0) model_x_knockoff.fit_importance(X, y) fdp_individual = [] power_individual = [] model_x_knockoff.importances_.shape for ko_statistics in model_x_knockoff.importances_: threshold = model_x_knockoff.knockoff_threshold(ko_statistics, fdr=fdr) ko_selection = ko_statistics > threshold fdp, power = fdp_power(ko_selection, ground_truth=beta_true) fdp_individual.append(fdp) power_individual.append(power) .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/src/hidimstat/samplers/gaussian_knockoffs.py:205: UserWarning: The equi-correlated matrix for knockoffs is not positive definite. Reduce the value of distance by 2.220446049250313e-16. warnings.warn( /home/circleci/project/src/hidimstat/samplers/gaussian_knockoffs.py:205: UserWarning: The equi-correlated matrix for knockoffs is not positive definite. Reduce the value of distance by 2.220446049250313e-15. warnings.warn( /home/circleci/project/src/hidimstat/samplers/gaussian_knockoffs.py:205: UserWarning: The equi-correlated matrix for knockoffs is not positive definite. Reduce the value of distance by 2.220446049250313e-14. warnings.warn( .. GENERATED FROM PYTHON SOURCE LINES 88-93 Visualize the results of the individual selections -------------------------------------------------- We first visualize the results of the individual selections to observe the variability induced by the knockoff lottery. We plot the False Discovery Proportion (FDP) for each run along with the desired FDR level (red dashed line) and the statistical .. GENERATED FROM PYTHON SOURCE LINES 93-135 .. code-block:: Python import matplotlib.pyplot as plt import pandas as pd import seaborn as sns df_plot = pd.DataFrame( { "FDP": fdp_individual, "Power": power_individual, } ) _, axes = plt.subplots(1, 2, figsize=(5, 3.5)) ax = axes[0] sns.swarmplot( data=df_plot, y="FDP", ax=ax, ) ax.axhline(fdr, color="tab:red", linestyle="--", lw=2, label="Desired FDR") ax.scatter( 0, np.mean(fdp_individual), marker="d", color="tab:orange", s=100, zorder=10, label="Empirical FDR", ) ax.legend(framealpha=0.2) # Plot the power ax = axes[1] sns.swarmplot( data=df_plot, y="Power", ax=ax, ) sns.despine() _ = plt.tight_layout() .. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_knockoff_aggregation_001.png :alt: plot knockoff aggregation :srcset: /generated/gallery/examples/images/sphx_glr_plot_knockoff_aggregation_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 136-142 Aggregation procedures ---------------------- We now compute the aggregation using both the p-values aggregation procedure from :footcite:t:`pmlr-v119-nguyen20a` and the e-values aggregation procedure from :footcite:t:`Ren_2023`. We then compare the results of both procedures in terms of FDP and power. .. GENERATED FROM PYTHON SOURCE LINES 142-209 .. code-block:: Python pval_aggregation = model_x_knockoff.fdr_selection(fdr=fdr, adaptive_aggregation=True) fdp_pval_agg, power_pval_agg = fdp_power(pval_aggregation, ground_truth=beta_true) eval_aggregation = model_x_knockoff.fdr_selection( fdr=fdr, fdr_control="ebh", evalues=True ) fdp_eval_agg, power_eval_agg = fdp_power(eval_aggregation, ground_truth=beta_true) df_plot["Method"] = "Individual KO" df_plot_2 = pd.concat( [ df_plot, pd.DataFrame( { "FDP": [fdp_pval_agg], "Power": [power_pval_agg], "Method": ["P-value aggregation"], } ), pd.DataFrame( { "FDP": [fdp_eval_agg], "Power": [power_eval_agg], "Method": ["E-value aggregation"], } ), ], ignore_index=True, ) # Plot the results # ---------------- # In addition to the individual selections (blue), we plot the FDR and power obtained by # p-value aggregation (orange) and e-value aggregation (green). # sphinx_gallery_thumbnail_number = 2 _, axes = plt.subplots(1, 2, figsize=(6, 3.5)) ax = axes[0] sns.stripplot( data=df_plot_2, y="FDP", hue="Method", ax=ax, palette="muted", dodge=1, legend=False, size=8, linewidth=1, ) ax.axhline(fdr, color="tab:red", linestyle="--", lw=2, label="Desired FDR") ax = axes[1] sns.stripplot( data=df_plot_2, y="Power", hue="Method", ax=ax, palette="muted", dodge=True, size=8, linewidth=1, ) sns.despine() _ = plt.tight_layout() .. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_knockoff_aggregation_002.png :alt: plot knockoff aggregation :srcset: /generated/gallery/examples/images/sphx_glr_plot_knockoff_aggregation_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 210-212 It appears that both aggregation procedures successfully lowers the false discovery proportion while maintaining a good statistical power. .. GENERATED FROM PYTHON SOURCE LINES 215-218 References ---------- .. footbibliography:: .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 2.840 seconds) **Estimated memory usage:** 215 MB .. _sphx_glr_download_generated_gallery_examples_plot_knockoff_aggregation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_knockoff_aggregation.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_knockoff_aggregation.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_knockoff_aggregation.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_