EnCluDL#
- class hidimstat.EnCluDL(desparsified_lasso, clustering, n_bootstraps=25, cluster_boostrap_size=0.3, bootstrap_groups=None, n_jobs=1, random_state=None, memory=None, ensembling_method='quantiles', gamma=0.5, adaptive_aggregation=False)[source]#
Bases:
BaseVariableImportanceEnsemble clustered inference with desparsified lasso. Performs multiple runs of clustered inference using different clustering obtained from random subsamples of the data. The results are then aggregated to provide robust feature importance scores and p-values. This algorithm is based on the method described in [1].
- Parameters:
- desparsified_lasso: DesparsifiedLasso
An instance of the DesparsifiedLasso class for statistical inference.
- clustering: sklearn.cluster.FeatureAgglomeration
An instance of a clustering method that operates on features.
- n_bootstraps: int, optional (default=25)
Number of bootstrap iterations for ensemble inference.
- cluster_boostrap_size: float, optional (default=0.3)
Fraction of samples used for computing the clustering. When cluster_boostrap_size=1.0, all samples are used.
- bootstrap_groups: ndarray, shape (n_samples,), optional (default=None)
Sample group labels for stratified subsampling.
- n_jobsint or None, optional (default=1)
Number of parallel jobs.
- random_state: int, optional (default=None)
Random seed for reproducible subsampling.
- memoryjoblib.Memory or str, optional (default=None)
Used to cache the output of the clustering and inference computation. By default, no caching is done. If provided, it should be the path to the caching directory or a joblib.Memory object.
- ensembling_methodstr, optional (default=’quantiles’)
Method used for ensembling. Currently, the two available methods are ‘quantiles’ and ‘median’.
- gammafloat, optional (default=0.2)
Lowest gamma-quantile considered to compute the adaptive quantile aggregation formula. This parameter is used only if ensembling_method is ‘quantiles’.
- adaptive_aggregationbool, optional (default=True)
Whether to use adaptive quantile aggregation when ensembling_method is ‘quantiles’.
- Attributes:
- clustering_desparsified_lassos_list of DesparsifiedLasso
List of fitted CluDL estimators from each bootstrap.
- importances_ndarray, shape (n_features,) or (n_features, n_tasks)
Estimated coefficients at feature level.
- pvalues_ndarray, shape (n_features,)
P-values for each feature.
- .. footbibliography::
- __init__(desparsified_lasso, clustering, n_bootstraps=25, cluster_boostrap_size=0.3, bootstrap_groups=None, n_jobs=1, random_state=None, memory=None, ensembling_method='quantiles', gamma=0.5, adaptive_aggregation=False)[source]#
- fit(X, y)[source]#
Fit multiple clustered inferences on random subsamples of the data.
- Parameters:
- Xndarray, shape (n_samples, n_features)
Input data matrix.
- yndarray, shape (n_samples,) or (n_samples, n_tasks)
Target variable(s).
- Returns:
- selfEnCluDL
Fitted estimator.
- importance(X=None, y=None)[source]#
Compute feature importance by aggregating results from multiple clustered inferences.
- Parameters:
- X
Not used, present for API consistency by convention.
- y
Not used, present for API consistency by convention.
- fit_importance(X, y)[source]#
Fit the model and compute feature importance.
- Parameters:
- Xndarray, shape (n_samples, n_features)
Input data matrix.
- yndarray, shape (n_samples,) or (n_samples, n_tasks)
Target variable(s).
- Returns:
- importances_ndarray, shape (n_features,) or (n_features, n_tasks)
Estimated coefficients at feature level.
- fdr_selection(fdr, fdr_control='bhq', reshaping_function=None, two_tailed_test=True)[source]#
Overrides the signature to set two_tailed_test=True by default.
- fwer_selection(fwer, procedure='bonferroni', n_tests=None, two_tailed_test=False)[source]#
Performs feature selection based on Family-Wise Error Rate (FWER) control.
- Parameters:
- fwerfloat
The target family-wise error rate level (between 0 and 1)
- procedure{‘bonferroni’}, default=’bonferroni’
The FWER control method to use: - ‘bonferroni’: Bonferroni correction
- n_testsint or None, default=None
Factor for multiple testing correction. If None, uses the number of clusters or the number of features in this order.
- two_tailed_testbool, default=False
If True, uses the sign of the importance scores to indicate whether the selected features have positive or negative effects.
- Returns:
- selectedndarray of int
Integer array indicating the selected features. 1 indicates selected features with positive effects, -1 indicates selected features with negative effects, 0 indicates non-selected features.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- importance_selection(k_best=None, percentile=None, threshold_max=None, threshold_min=None)[source]#
Selects features based on variable importance.
- Parameters:
- k_bestint, default=None
Selects the top k features based on importance scores.
- percentilefloat, default=None
Selects features based on a specified percentile of importance scores.
- threshold_maxfloat, default=None
Selects features with importance scores below the specified maximum threshold.
- threshold_minfloat, default=None
Selects features with importance scores above the specified minimum threshold.
- Returns:
- selectionarray-like of shape (n_features,)
Binary array indicating the selected features.
- plot_importance(ax=None, ascending=False, feature_names=None, **seaborn_barplot_kwargs)[source]#
Plot feature importances as a horizontal bar plot.
- Parameters:
- axmatplotlib.axes.Axes or None, (default=None)
Axes object to draw the plot onto, otherwise uses the current Axes.
- ascending: bool, default=False
Whether to sort features by ascending importance.
- **seaborn_barplot_kwargsadditional keyword arguments
Additional arguments passed to seaborn.barplot. https://seaborn.pydata.org/generated/seaborn.barplot.html
- Returns:
- axmatplotlib.axes.Axes
The Axes object with the plot.
- pvalue_selection(k_lowest=None, percentile=None, threshold_max=0.05, threshold_min=None, alternative_hypothesis=False)[source]#
Selects features based on p-values.
- Parameters:
- k_lowestint, default=None
Selects the k features with lowest p-values.
- percentilefloat, default=None
Selects features based on a specified percentile of p-values.
- threshold_maxfloat, default=0.05
Selects features with p-values below the specified maximum threshold (0 to 1).
- threshold_minfloat, default=None
Selects features with p-values above the specified minimum threshold (0 to 1).
- alternative_hypothesisbool, default=False
If True, selects based on 1-pvalues instead of p-values.
- Returns:
- selectionarray-like of shape (n_features,)
Binary array indicating the selected features (True for selected).
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.