.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "generated/gallery/examples/plot_diabetes_variable_importance_example.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_generated_gallery_examples_plot_diabetes_variable_importance_example.py: Feature Importance on diabetes dataset using cross-validation ============================================================= In this example, we show how to compute variable importance using Permutation Feature Importance (PFI), Leave-One-Covariate-Out (LOCO), and Conditional Feature Importance (CFI) on the diabetes dataset. This example also showcases the use how to measure feature importance in a K-Fold cross-validation setting in order to use all the data available. .. GENERATED FROM PYTHON SOURCE LINES 13-16 Load the diabetes dataset ------------------------- We start by loading the diabetes dataset from sklearn. .. GENERATED FROM PYTHON SOURCE LINES 16-26 .. code-block:: Python from sklearn.datasets import load_diabetes diabetes = load_diabetes() X, y = diabetes.data, diabetes.target # Encode sex as binary X[:, 1] = (X[:, 1] > 0.0).astype(int) print(f"Number of samples: {X.shape[0]}, number of features: {X.shape[1]}") .. rst-class:: sphx-glr-script-out .. code-block:: none Number of samples: 442, number of features: 10 .. GENERATED FROM PYTHON SOURCE LINES 27-34 Fit a baseline model on the diabetes dataset -------------------------------------------- The benefit of perturbation-based variable importance methods, presented in this example, is that they are model-agnostic. Therefore, we can use any regression model. We here leverage this flexibility, using an ensemble model which consists of a Ridge regression model and a Histogram Gradient Boosting model, a Random Forest model, and a Lasso regression model combined with a Voting Regressor. .. GENERATED FROM PYTHON SOURCE LINES 34-67 .. code-block:: Python import numpy as np from sklearn.base import clone from sklearn.ensemble import ( HistGradientBoostingRegressor, RandomForestRegressor, VotingRegressor, ) from sklearn.linear_model import LassoCV, RidgeCV from sklearn.metrics import r2_score from sklearn.model_selection import KFold n_folds = 5 cv = KFold(n_splits=n_folds, shuffle=True, random_state=0) regressor = VotingRegressor( [ ("ridge", RidgeCV()), ("hgb", HistGradientBoostingRegressor()), ("rf", RandomForestRegressor()), ("lasso", LassoCV()), ] ) scores = [] regressor_list = [clone(regressor) for _ in range(n_folds)] for i, (train_index, test_index) in enumerate(cv.split(X)): regressor_list[i].fit(X[train_index], y[train_index]) scores.append( r2_score(y_true=y[test_index], y_pred=regressor_list[i].predict(X[test_index])) ) print(f"R2 scores across folds: {np.mean(scores):.3f} ± {np.std(scores):.3f}") regressor .. rst-class:: sphx-glr-script-out .. code-block:: none R2 scores across folds: 0.479 ± 0.096 .. raw:: html
VotingRegressor(estimators=[('ridge', RidgeCV()),
                                ('hgb', HistGradientBoostingRegressor()),
                                ('rf', RandomForestRegressor()),
                                ('lasso', LassoCV())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 68-79 Measure the importance of variables ----------------------------------- We now measure the importance of each variable using the three different methods: Conditional Feature Importance (CFI), Leave-One-Covariate-Out (LOCO), and Permutation Feature Importance (PFI). We use the K-Fold cross-validation scheme to leverage all the data available. This however comes with the challenge that the test statistics computed across folds are not independent since overlapping training sets are used to fit the model. To address this issue, we use the Nadeau-Bengio corrected t-test :footcite:t:`nadeau1999inference` which adjusts the variance estimation to account for the dependency between the test statistics. We use the `n_jobs` parameter to parallelize the computation across folds. .. GENERATED FROM PYTHON SOURCE LINES 79-92 .. code-block:: Python from hidimstat import CFICV cfi_cv = CFICV( estimators=regressor_list, cv=cv, n_jobs=5, statistical_test="nb-ttest", random_state=0, ) importances_cfi = cfi_cv.fit_importance(X, y) .. rst-class:: sphx-glr-script-out .. code-block:: none Fitting importance estimators for each fold: 0%| | 0/5 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 164-169 Several trends can be observed from the results: PFI tends to give smaller p-values (that is higher bars in the plot) than LOCO and CFI. This is expected since PFI is known to overestimate the importance of correlated variables. On the other hand, LOCO has in general, larger p-values (smaller bars in the plot). This is also a known trend since LOCO tends to suffer from lower statistical power. .. GENERATED FROM PYTHON SOURCE LINES 172-175 References ---------- .. footbibliography:: .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 36.147 seconds) **Estimated memory usage:** 1091 MB .. _sphx_glr_download_generated_gallery_examples_plot_diabetes_variable_importance_example.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_diabetes_variable_importance_example.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_diabetes_variable_importance_example.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_diabetes_variable_importance_example.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_