Variable Importance on diabetes dataset#

Variable Importance estimates the influence of a given input variable to the prediction made by a model. To assess variable importance in a prediction problem, Breiman[1] introduced the permutation approach where the values are shuffled for one variable/column at a time. This permutation breaks the relationship between the variable of interest and the outcome. Following, the loss score is checked before and after this substitution for any significant drop in the performance which reflects the significance of this variable to predict the outcome. This ease-to-use solution is demonstrated, in the work by Strobl et al.[2], to be affected by the degree of correlation between the variables, thus biased towards truly non-significant variables highly correlated with the significant ones and creating fake significant variables. They introduced a solution for the Random Forest estimator based on conditional sampling by performing sub-groups permutation when bisecting the space using the conditioning variables of the building process. However, this solution is exclusive to the Random Forest and is costly with high-dimensional settings. Chamma et al.[3] introduced a new model-agnostic solution to bypass the limitations of the permutation approach under the use of the conditional schemes. The variable of interest does contain two types of information: 1) the relationship with the remaining variables and 2) the relationship with the outcome. The standard permutation, while breaking the relationship with the outcome, is also destroying the dependency with the remaining variables. Therefore, instead of directly permuting the variable of interest, the variable of interest is predicted by the remaining variables and the residuals of this prediction are permuted before reconstructing the new version of the variable. This solution preserves the dependency with the remaining variables.

In this example, we compare both the standard permutation and its conditional variant approaches for variable importance on the diabetes dataset for the single-level case. The aim is to see if integrating the new statistically-controlled solution has an impact on the results.

References#

Load the diabetes dataset#

from sklearn.datasets import load_diabetes

diabetes = load_diabetes()
X, y = diabetes.data, diabetes.target

# Encode sex as binary
X[:, 1] = (X[:, 1] > 0.0).astype(int)

Fit a baseline model on the diabetes dataset#

We use a Ridge regression model with a 5-fold cross-validation to fit the diabetes dataset.

import numpy as np
from sklearn.base import clone
from sklearn.linear_model import LogisticRegressionCV, RidgeCV
from sklearn.metrics import r2_score, root_mean_squared_error
from sklearn.model_selection import KFold

n_folds = 5
regressor = RidgeCV(
    alphas=np.logspace(-3, 3, 10),
    cv=KFold(shuffle=True, random_state=20),
)
regressor_list = [clone(regressor) for _ in range(n_folds)]
kf = KFold(n_splits=n_folds, shuffle=True, random_state=21)
for i, (train_index, test_index) in enumerate(kf.split(X)):
    regressor_list[i].fit(X[train_index], y[train_index])
    score = r2_score(
        y_true=y[test_index], y_pred=regressor_list[i].predict(X[test_index])
    )
    mse = root_mean_squared_error(
        y_true=y[test_index], y_pred=regressor_list[i].predict(X[test_index])
    )

    print(f"Fold {i}: {score=}")
    print(f"Fold {i}: {mse=}")
Fold 0: score=0.42498602106170535
Fold 0: mse=56.42055334073377
Fold 1: score=0.4576508881883167
Fold 1: mse=58.423308135994034
Fold 2: score=0.5351768182569143
Fold 2: mse=54.730052311563774
Fold 3: score=0.48890245901647933
Fold 3: mse=53.411951707073335
Fold 4: score=0.5377531849343379
Fold 4: mse=51.2067043978449

Measure the importance of variables using the CFI method#

from hidimstat import CFI

cfi_importance_list = []
kf = KFold(n_splits=n_folds, shuffle=True, random_state=21)
for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    cfi = CFI(
        estimator=regressor_list[i],
        imputation_model_continuous=RidgeCV(alphas=np.logspace(-3, 3, 10), cv=KFold()),
        imputation_model_categorical=LogisticRegressionCV(
            Cs=np.logspace(-2, 2, 10),
            cv=KFold(),
        ),
        # covariate_estimator=HistGradientBoostingRegressor(random_state=0,),
        n_permutations=50,
        random_state=24,
        n_jobs=4,
    )
    cfi.fit(X_train, y_train)
    importance = cfi.importance(X_test, y_test)
    cfi_importance_list.append(importance)
Fold 0
Fold 1
Fold 2
Fold 3
Fold 4

Measure the importance of variables using the LOCO method#

from hidimstat import LOCO

loco_importance_list = []
kf = KFold(n_splits=n_folds, shuffle=True, random_state=21)
for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    loco = LOCO(
        estimator=regressor_list[i],
        n_jobs=4,
    )
    loco.fit(X_train, y_train)
    importance = loco.importance(X_test, y_test)
    loco_importance_list.append(importance)
Fold 0
Fold 1
Fold 2
Fold 3
Fold 4

Measure the importance of variables using the permutation method#

from hidimstat import PFI

pfi_importance_list = []
kf = KFold(n_splits=n_folds, shuffle=True, random_state=21)
for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    pfi = PFI(
        estimator=regressor_list[i],
        n_permutations=50,
        random_state=25,
        n_jobs=4,
    )
    pfi.fit(X_train, y_train)
    importance = pfi.importance(X_test, y_test)
    pfi_importance_list.append(importance)
Fold 0
Fold 1
Fold 2
Fold 3
Fold 4

Analyze the results#

import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import ttest_1samp

cfi_vim_arr = np.array([x["importance"] for x in cfi_importance_list]) / 2
cfi_pval = ttest_1samp(cfi_vim_arr, 0, alternative="greater").pvalue

vim = [
    pd.DataFrame(
        {
            "var": np.arange(cfi_vim_arr.shape[1]),
            "importance": x["importance"],
            "fold": i,
            "pval": cfi_pval,
            "method": "CFI",
        }
    )
    for x in cfi_importance_list
]

loco_vim_arr = np.array([x["importance"] for x in loco_importance_list])
loco_pval = ttest_1samp(loco_vim_arr, 0, alternative="greater").pvalue

vim += [
    pd.DataFrame(
        {
            "var": np.arange(loco_vim_arr.shape[1]),
            "importance": x["importance"],
            "fold": i,
            "pval": loco_pval,
            "method": "LOCO",
        }
    )
    for x in loco_importance_list
]

pfi_vim_arr = np.array([x["importance"] for x in pfi_importance_list])
pfi_pval = ttest_1samp(pfi_vim_arr, 0, alternative="greater").pvalue

vim += [
    pd.DataFrame(
        {
            "var": np.arange(pfi_vim_arr.shape[1]),
            "importance": x["importance"],
            "fold": i,
            "pval": pfi_pval,
            "method": "PFI",
        }
    )
    for x in pfi_importance_list
]

fig, ax = plt.subplots()
df_plot = pd.concat(vim)
df_plot["pval"] = -np.log10(df_plot["pval"])
methods = df_plot["method"].unique()
colors = plt.get_cmap("tab10", 10)

for i, method in enumerate(methods):
    subset = df_plot[df_plot["method"] == method]
    ax.bar(
        subset["var"] + i * 0.2,
        subset["pval"],
        width=0.2,
        label=method,
        color=colors(i),
    )

ax.legend(title="Method")
ax.set_ylabel(r"$-\log_{10}(\text{p-value})$")
ax.axhline(-np.log10(0.05), color="tab:red", ls="--")
ax.set_xlabel("Variable")
ax.set_xticks(range(len(diabetes.feature_names)))
ax.set_xticklabels(diabetes.feature_names)
plt.show()
plot diabetes variable importance example

Total running time of the script: (0 minutes 33.766 seconds)

Gallery generated by Sphinx-Gallery