.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "generated/gallery/examples/plot_diabetes_variable_importance_example.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_generated_gallery_examples_plot_diabetes_variable_importance_example.py: Feature Importance on diabetes dataset using cross-validation ============================================================= In this example, we show how to compute variable importance using Permutation Feature Importance (PFI), Leave-One-Covariate-Out (LOCO), and Conditional Feature Importance (CFI) on the diabetes dataset. This example also showcases the use how to measure feature importance in a K-Fold cross-validation setting in order to use all the data available. .. GENERATED FROM PYTHON SOURCE LINES 13-16 Load the diabetes dataset ------------------------- We start by loading the diabetes dataset from sklearn. .. GENERATED FROM PYTHON SOURCE LINES 16-26 .. code-block:: Python from sklearn.datasets import load_diabetes diabetes = load_diabetes() X, y = diabetes.data, diabetes.target # Encode sex as binary X[:, 1] = (X[:, 1] > 0.0).astype(int) print(f"Number of samples: {X.shape[0]}, number of features: {X.shape[1]}") .. rst-class:: sphx-glr-script-out .. code-block:: none Number of samples: 442, number of features: 10 .. GENERATED FROM PYTHON SOURCE LINES 27-34 Fit a baseline model on the diabetes dataset -------------------------------------------- The benefit of perturbation-based variable importance methods, presented in this example, is that they are model-agnostic. Therefore, we can use any regression model. We here leverage this flexibility, using an ensemble model which consists of a Ridge regression model and a Histogram Gradient Boosting model, a Random Forest model, and a Lasso regression model combined with a Voting Regressor. .. GENERATED FROM PYTHON SOURCE LINES 34-67 .. code-block:: Python import numpy as np from sklearn.base import clone from sklearn.ensemble import ( HistGradientBoostingRegressor, RandomForestRegressor, VotingRegressor, ) from sklearn.linear_model import LassoCV, RidgeCV from sklearn.metrics import r2_score from sklearn.model_selection import KFold n_folds = 5 cv = KFold(n_splits=n_folds, shuffle=True, random_state=0) regressor = VotingRegressor( [ ("ridge", RidgeCV()), ("hgb", HistGradientBoostingRegressor()), ("rf", RandomForestRegressor()), ("lasso", LassoCV()), ] ) scores = [] regressor_list = [clone(regressor) for _ in range(n_folds)] for i, (train_index, test_index) in enumerate(cv.split(X)): regressor_list[i].fit(X[train_index], y[train_index]) scores.append( r2_score(y_true=y[test_index], y_pred=regressor_list[i].predict(X[test_index])) ) print(f"R2 scores across folds: {np.mean(scores):.3f} ± {np.std(scores):.3f}") regressor .. rst-class:: sphx-glr-script-out .. code-block:: none R2 scores across folds: 0.479 ± 0.096 .. raw:: html

VotingRegressor(estimators=[('ridge', RidgeCV()),
                                ('hgb', HistGradientBoostingRegressor()),
                                ('rf', RandomForestRegressor()),
                                ('lasso', LassoCV())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

VotingRegressor

?Documentation for VotingRegressoriNot fitted

Parameters

	estimators	[('ridge', ...), ('hgb', ...), ...]
	weights	None
	n_jobs	None
	verbose	False

ridge

RidgeCV

?Documentation for RidgeCV

Parameters

	alphas	(0.1, ...)
	fit_intercept	True
	scoring	None
	cv	None
	gcv_mode	None
	store_cv_results	False
	alpha_per_target	False

hgb

HistGradientBoostingRegressor

?Documentation for HistGradientBoostingRegressor

Parameters

	loss	'squared_error'
	quantile	None
	learning_rate	0.1
	max_iter	100
	max_leaf_nodes	31
	max_depth	None
	min_samples_leaf	20
	l2_regularization	0.0
	max_features	1.0
	max_bins	255
	categorical_features	'from_dtype'
	monotonic_cst	None
	interaction_cst	None
	warm_start	False
	early_stopping	'auto'
	scoring	'loss'
	validation_fraction	0.1
	n_iter_no_change	10
	tol	1e-07
	verbose	0
	random_state	None

rf

RandomForestRegressor

?Documentation for RandomForestRegressor

Parameters

	n_estimators	100
	criterion	'squared_error'
	max_depth	None
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	1.0
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	bootstrap	True
	oob_score	False
	n_jobs	None
	random_state	None
	verbose	0
	warm_start	False
	ccp_alpha	0.0
	max_samples	None
	monotonic_cst	None

lasso

LassoCV

?Documentation for LassoCV

Parameters

	eps	0.001
	n_alphas	'deprecated'
	alphas	'warn'
	fit_intercept	True
	precompute	'auto'
	max_iter	1000
	tol	0.0001
	copy_X	True
	cv	None
	verbose	False
	n_jobs	None
	positive	False
	random_state	None
	selection	'cyclic'

.. GENERATED FROM PYTHON SOURCE LINES 68-79 Measure the importance of variables ----------------------------------- We now measure the importance of each variable using the three different methods: Conditional Feature Importance (CFI), Leave-One-Covariate-Out (LOCO), and Permutation Feature Importance (PFI). We use the K-Fold cross-validation scheme to leverage all the data available. This however comes with the challenge that the test statistics computed across folds are not independent since overlapping training sets are used to fit the model. To address this issue, we use the Nadeau-Bengio corrected t-test :footcite:t:`nadeau1999inference` which adjusts the variance estimation to account for the dependency between the test statistics. We use the `n_jobs` parameter to parallelize the computation across folds. .. GENERATED FROM PYTHON SOURCE LINES 79-92 .. code-block:: Python from hidimstat import CFICV cfi_cv = CFICV( estimators=regressor_list, cv=cv, n_jobs=5, statistical_test="nb-ttest", random_state=0, ) importances_cfi = cfi_cv.fit_importance(X, y) .. rst-class:: sphx-glr-script-out .. code-block:: none Fitting importance estimators for each fold: 0%| | 0/5 [00:00 .. GENERATED FROM PYTHON SOURCE LINES 164-169 Several trends can be observed from the results: PFI tends to give smaller p-values (that is higher bars in the plot) than LOCO and CFI. This is expected since PFI is known to overestimate the importance of correlated variables. On the other hand, LOCO has in general, larger p-values (smaller bars in the plot). This is also a known trend since LOCO tends to suffer from lower statistical power. .. GENERATED FROM PYTHON SOURCE LINES 172-175 References ---------- .. footbibliography:: .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 36.147 seconds) **Estimated memory usage:** 1091 MB .. _sphx_glr_download_generated_gallery_examples_plot_diabetes_variable_importance_example.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_diabetes_variable_importance_example.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_diabetes_variable_importance_example.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_diabetes_variable_importance_example.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_