PFICV#
- class hidimstat.PFICV(estimators, cv, statistical_test='nb-ttest', method='predict', loss=<function mean_squared_error>, n_permutations=50, features_groups=None, random_state=None, n_jobs=1)[source]#
Bases:
BasePerturbationCVPermutation Feature Importance (PFI) algorithm with Cross-Validation.
- Parameters:
- estimators: list of sklearn estimators or single sklearn estimator
Can be a list of fitted sklearn estimators (one per fold) or a single sklearn estimator that will then be cloned and fitted on each fold.
- cv: cross-validation generator
A cross-validation generator object (e.g., KFold, StratifiedKFold).
- statistical_testcallable or str, default=”nb-ttest”
Statistical test function for computing p-values from importance scores.
- methodstr, default=”predict”
The method to use for the prediction. This determines the predictions passed to the loss function. Supported methods are “predict”, “predict_proba” or “decision_function”.
- losscallable, default=mean_squared_error
The loss function to use when comparing the perturbed model to the full model.
- n_permutationsint, default=50
The number of permutations to perform. For each variable/group of variables, the mean of the losses over the n_permutations is computed.
- features_groups: dict or None, default=None
A dictionary where the keys are the group names and the values are the list of column names corresponding to each features group. If None, the features_groups are identified based on the columns of X.
- random_stateint or None, default=None
The random state to use for sampling.
- n_jobsint, default=1
The number of jobs to run in parallel. Parallelization is done over the folds.
- Attributes:
- importance_estimators_list of PFI instances
The PFI instances fitted on each fold.
- importances_ndarray of shape (n_groups, n_folds)
The calculated importance scores for each feature group and each fold. Higher values indicate greater importance.
- pvalues_ndarray of shape (n_groups,)
The p-values for the importance scores computed across folds.
- estimators_list of sklearn estimators
List of fitted estimators for each fold.
- test_train_frac_float
Fraction of test samples over train samples in each fold. Approximated as 1 / (n_splits - 1).
- __init__(estimators, cv, statistical_test='nb-ttest', method='predict', loss=<function mean_squared_error>, n_permutations=50, features_groups=None, random_state=None, n_jobs=1)[source]#
- fdr_selection(fdr, fdr_control='bhq', reshaping_function=None, two_tailed_test=False)[source]#
Performs feature selection based on False Discovery Rate (FDR) control.
- Parameters:
- fdrfloat
The target false discovery rate level (between 0 and 1)
- fdr_control: {‘bhq’, ‘bhy’}, default=’bhq’
The FDR control method to use: - ‘bhq’: Benjamini-Hochberg procedure - ‘bhy’: Benjamini-Hochberg-Yekutieli procedure
- reshaping_function: callable or None, default=None
Optional reshaping function for FDR control methods. If None, defaults to sum of reciprocals for ‘bhy’.
- two_tailed_test: bool, default=False
If True, performs two-tailed test selection using both p-values for positive effects and one-minus p-values for negative effects. The sign of the effect is determined from the sign of the importance scores.
- Returns:
- selectedndarray of int
Integer array indicating the selected features. 1 indicates selected features with positive effects, -1 indicates selected features with negative effects, 0 indicates non-selected features.
- Raises:
- ValueError
If importances_ haven’t been computed yet
- AssertionError
If pvalues_ are missing or fdr_control is invalid
- fwer_selection(fwer, procedure='bonferroni', n_tests=None, two_tailed_test=False)[source]#
Performs feature selection based on Family-Wise Error Rate (FWER) control.
- Parameters:
- fwerfloat
The target family-wise error rate level (between 0 and 1)
- procedure{‘bonferroni’}, default=’bonferroni’
The FWER control method to use: - ‘bonferroni’: Bonferroni correction
- n_testsint or None, default=None
Factor for multiple testing correction. If None, uses the number of clusters or the number of features in this order.
- two_tailed_testbool, default=False
If True, uses the sign of the importance scores to indicate whether the selected features have positive or negative effects.
- Returns:
- selectedndarray of int
Integer array indicating the selected features. 1 indicates selected features with positive effects, -1 indicates selected features with negative effects, 0 indicates non-selected features.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- importance(X, y)[source]#
Compute the importance scores using cross-validation.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The input samples to compute importance scores for.
- yarray-like of shape (n_samples,)
- Returns:
- importances_ndarray of shape (n_features, n_groups)
The importance scores for each group of features.
- importance_selection(k_best=None, percentile=None, threshold_max=None, threshold_min=None)[source]#
Selects features based on variable importance.
- Parameters:
- k_bestint, default=None
Selects the top k features based on importance scores.
- percentilefloat, default=None
Selects features based on a specified percentile of importance scores.
- threshold_maxfloat, default=None
Selects features with importance scores below the specified maximum threshold.
- threshold_minfloat, default=None
Selects features with importance scores above the specified minimum threshold.
- Returns:
- selectionarray-like of shape (n_features,)
Binary array indicating the selected features.
- plot_importance(ax=None, ascending=False, feature_names=None, **seaborn_barplot_kwargs)[source]#
Plot feature importances as a horizontal bar plot.
- Parameters:
- axmatplotlib.axes.Axes or None, (default=None)
Axes object to draw the plot onto, otherwise uses the current Axes.
- ascending: bool, default=False
Whether to sort features by ascending importance.
- **seaborn_barplot_kwargsadditional keyword arguments
Additional arguments passed to seaborn.barplot. https://seaborn.pydata.org/generated/seaborn.barplot.html
- Returns:
- axmatplotlib.axes.Axes
The Axes object with the plot.
- pvalue_selection(k_lowest=None, percentile=None, threshold_max=0.05, threshold_min=None, alternative_hypothesis=False)[source]#
Selects features based on p-values.
- Parameters:
- k_lowestint, default=None
Selects the k features with lowest p-values.
- percentilefloat, default=None
Selects features based on a specified percentile of p-values.
- threshold_maxfloat, default=0.05
Selects features with p-values below the specified maximum threshold (0 to 1).
- threshold_minfloat, default=None
Selects features with p-values above the specified minimum threshold (0 to 1).
- alternative_hypothesisbool, default=False
If True, selects based on 1-pvalues instead of p-values.
- Returns:
- selectionarray-like of shape (n_features,)
Binary array indicating the selected features (True for selected).
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.
Examples using hidimstat.PFICV#
Feature Importance on diabetes dataset using cross-validation