.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_knockoff_aggregation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_knockoff_aggregation.py: Knockoff aggregation on simulated data ====================================== In this example, we show an example of variable selection using model-X Knockoffs introduced by :footcite:t:`candes2018panning`. A notable drawback of this procedure is the randomness associated with generating knockoff variables. This can result in fluctuations of the statistical power and false discovery proportion, and consequently, unstable inference. This example exhibits the two aggregation procedures described by :footcite:t:`pmlr-v119-nguyen20a` and :footcite:t:`Ren_2023` to derandomize inference. .. GENERATED FROM PYTHON SOURCE LINES 17-19 Imports needed for this script ------------------------------ .. GENERATED FROM PYTHON SOURCE LINES 19-37 .. code-block:: Python from joblib import Parallel, delayed import matplotlib.pyplot as plt import numpy as np from sklearn.linear_model import LassoCV from sklearn.model_selection import KFold from sklearn.utils import check_random_state from hidimstat.knockoffs import ( model_x_knockoff, model_x_knockoff_bootstrap_e_value, model_x_knockoff_bootstrap_quantile, model_x_knockoff_pvalue, ) from hidimstat.statistical_tools.multiple_testing import fdp_power from hidimstat._utils.scenario import multivariate_1D_simulation_AR .. GENERATED FROM PYTHON SOURCE LINES 38-44 Data simulation ----------------------- The comparison of the three methods relies on evaluating the False Discovery Proportion (FDP) and statistical power. Assessing these metrics requires knowledge of the actual data-generating process. We therefore use simulated data with the following parameters: .. GENERATED FROM PYTHON SOURCE LINES 44-73 .. code-block:: Python # number of repetitions of the methods runs = 20 # Number of observations n_samples = 200 # Number of variables n_features = 150 # Correlation parameter rho = 0.4 # Ratio of number of variables with non-zero coefficients over total # coefficients sparsity = 0.2 # Desired controlled False Discovery Rate (FDR) level fdr = 0.1 # signal noise ration snr = 10 # number of repetitions for the bootstraps n_bootstraps = 25 # seed for the random generator seed = 45 # number of jobs for repetition of the method n_jobs = 2 # verbosity of the joblib joblib_verbose = 0 rng = check_random_state(seed) seed_list = rng.randint(1, np.iinfo(np.int32).max, runs) .. GENERATED FROM PYTHON SOURCE LINES 74-76 Define the function for running the three procedures on the same data --------------------------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 76-126 .. code-block:: Python def single_run(n_samples, n_features, rho, sparsity, snr, fdr, n_bootstraps, seed=None): # Generate data X, y, _, non_zero_index = multivariate_1D_simulation_AR( n_samples, n_features, rho=rho, sparsity=sparsity, seed=seed, snr=snr ) # Use model-X Knockoffs [1] selected, test_scores, threshold, X_tildes = model_x_knockoff( X, y, estimator=LassoCV( n_jobs=1, cv=KFold(n_splits=5, shuffle=True, random_state=0), ), n_bootstraps=1, random_state=seed, ) mx_selection, _ = model_x_knockoff_pvalue(test_scores, fdr=fdr) fdp_mx, power_mx = fdp_power(mx_selection, non_zero_index) # Use aggregation model-X Knockoffs [2] selected, test_scores, threshold, X_tildes = model_x_knockoff( X, y, estimator=LassoCV( n_jobs=1, cv=KFold(n_splits=5, shuffle=True, random_state=0), ), n_bootstraps=n_bootstraps, n_jobs=1, random_state=seed, ) # Use p-values aggregation [2] aggregated_ko_selection, _, _ = model_x_knockoff_bootstrap_quantile( test_scores, fdr=fdr, adaptive_aggregation=True ) fdp_pval, power_pval = fdp_power(aggregated_ko_selection, non_zero_index) # Use e-values aggregation [3] eval_selection, _, _ = model_x_knockoff_bootstrap_e_value( test_scores, threshold, fdr=fdr ) fdp_eval, power_eval = fdp_power(eval_selection, non_zero_index) return fdp_mx, fdp_pval, fdp_eval, power_mx, power_pval, power_eval .. GENERATED FROM PYTHON SOURCE LINES 127-129 Define the function for plotting the result ------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 129-161 .. code-block:: Python def plot_results(bounds, fdr, n_samples, n_features, power=False): plt.figure(figsize=(5, 5), layout="constrained") for nb in range(len(bounds)): for i in range(len(bounds[nb])): y = bounds[nb][i] x = np.random.normal(nb + 1, 0.05) plt.scatter(x, y, alpha=0.65, c="blue") plt.boxplot(bounds, sym="") if power: plt.xticks( [1, 2, 3], ["MX Knockoffs", "Quantile aggregation", "e-values aggregation"], rotation=45, ha="right", ) plt.title(f"FDR = {fdr}, n = {n_samples}, p = {n_features}") plt.ylabel("Empirical Power") else: plt.hlines(fdr, xmin=0.5, xmax=3.5, label="Requested FDR control", color="red") plt.xticks( [1, 2, 3], ["MX Knockoffs", "Quantile aggregation", "e-values aggregation"], rotation=45, ha="right", ) plt.title(f"FDR = {fdr}, n = {n_samples}, p = {n_features}") plt.ylabel("Empirical FDP") plt.legend(loc="best") .. GENERATED FROM PYTHON SOURCE LINES 162-164 Define the function for evaluate the effect of the population ------------------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 164-198 .. code-block:: Python def effect_number_samples(n_samples): parallel = Parallel(n_jobs, verbose=joblib_verbose) results = parallel( delayed(single_run)( n_samples, n_features, rho, sparsity, snr, fdr, n_bootstraps, seed=seed ) for seed in seed_list ) fdps_mx = [] fdps_pval = [] fdps_eval = [] powers_mx = [] powers_pval = [] powers_eval = [] for fdp_mx, fdp_pval, fdp_eval, power_mx, power_pval, power_eval in results: fdps_mx.append(fdp_mx) fdps_pval.append(fdp_pval) fdps_eval.append(fdp_eval) powers_mx.append(power_mx) powers_pval.append(power_pval) powers_eval.append(power_eval) # Plot FDP and Power distributions fdps = [fdps_mx, fdps_pval, fdps_eval] powers = [powers_mx, powers_pval, powers_eval] plot_results(fdps, fdr, n_samples, n_features) plot_results(powers, fdr, n_samples, n_features, power=True) plt.show() .. GENERATED FROM PYTHON SOURCE LINES 199-201 Aggregation methods provide a more stable inference --------------------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 201-203 .. code-block:: Python effect_number_samples(n_samples=n_samples) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /auto_examples/images/sphx_glr_plot_knockoff_aggregation_001.png :alt: FDR = 0.1, n = 200, p = 150 :srcset: /auto_examples/images/sphx_glr_plot_knockoff_aggregation_001.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_plot_knockoff_aggregation_002.png :alt: FDR = 0.1, n = 200, p = 150 :srcset: /auto_examples/images/sphx_glr_plot_knockoff_aggregation_002.png :class: sphx-glr-multi-img .. GENERATED FROM PYTHON SOURCE LINES 204-208 By repeating the model-X Knockoffs, we can see that instability of the inference. Additionally, we can see that the aggregation method is more stable. However, the e-values aggregation is more conservative, i.e. the exepect variables of importance is not find. .. GENERATED FROM PYTHON SOURCE LINES 210-212 Limitation of the aggregation methods ------------------------------------- .. GENERATED FROM PYTHON SOURCE LINES 212-214 .. code-block:: Python effect_number_samples(n_samples=50) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /auto_examples/images/sphx_glr_plot_knockoff_aggregation_003.png :alt: FDR = 0.1, n = 50, p = 150 :srcset: /auto_examples/images/sphx_glr_plot_knockoff_aggregation_003.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/images/sphx_glr_plot_knockoff_aggregation_004.png :alt: FDR = 0.1, n = 50, p = 150 :srcset: /auto_examples/images/sphx_glr_plot_knockoff_aggregation_004.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/.venv/lib/python3.13/site-packages/sklearn/linear_model/_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.994e-01, tolerance: 1.456e-01 model = cd_fast.enet_coordinate_descent( /home/circleci/project/.venv/lib/python3.13/site-packages/sklearn/linear_model/_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 2.031e-01, tolerance: 1.456e-01 model = cd_fast.enet_coordinate_descent( .. GENERATED FROM PYTHON SOURCE LINES 215-217 One important point of this method is that they require enough samples to estimate the distribution of each feature. .. GENERATED FROM PYTHON SOURCE LINES 219-222 References ---------- .. footbibliography:: .. rst-class:: sphx-glr-timing **Total running time of the script:** (1 minutes 18.218 seconds) .. _sphx_glr_download_auto_examples_plot_knockoff_aggregation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_knockoff_aggregation.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_knockoff_aggregation.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_knockoff_aggregation.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_