hidimstat.D0CRT#

class hidimstat.D0CRT(estimator, method: str = 'predict', estimated_coef=None, sigma_X=None, params_lasso_screening={'alpha': None, 'alpha_max_fraction': 0.5, 'alphas': None, 'cv': 5, 'fit_intercept': False, 'max_iter': 1000, 'n_alphas': 10, 'selection': 'cyclic', 'tol': 1e-06}, params_lasso_distillation_x=None, refit=False, screening=True, screening_threshold=10, centered=True, n_jobs=1, joblib_verbose=0, fit_y=False, scaled_statistics=False, random_state=2022)[source]#

Bases: BaseVariableImportance

Implements distilled conditional randomization test (dCRT) without interactions.

This class provides a fast implementation of the Conditional Randomization Test Candes et al.[1] using the distillation process from Liu et al.[2]. The approach accelerates variable selection by combining Lasso-based screening and residual-based test statistics. Based on the original implementation at: moleibobliu/Distillation-CRT The y-distillation is based on a given estimator and the x-distillation is based on a Lasso estimator.

Parameters:
estimatorsklearn estimator

The base estimator used for y-distillation and prediction (e.g., Lasso, RandomForest, …).

methodstr, default=”predict”

Method of the estimator to use for predictions (“predict”, “predict_proba”, etc.).

estimated_coefarray-like of shape (n_features,) or None, default=None

Pre-computed feature coefficients. If None, coefficients are estimated via Lasso.

sigma_Xarray-like of shape (n_features, n_features) or None, default=None

Covariance matrix of X. If None, Lasso is used for X distillation.

params_lasso_screeningdict

Parameters for variable screening Lasso: - alpha : float or None - L1 regularization strength. If None, determined by CV. - n_alphas : int - Number of alphas for cross-validation (default: 10). - alphas : array-like or None - List of alpha values to try in CV (default: None). - alpha_max_fraction : float - Scale factor for alpha_max (default: 0.5). - cv : int - Cross-validation folds (default: 5). - tol : float - Convergence tolerance (default: 1e-6). - max_iter : int - Maximum iterations (default: 1000). - fit_intercept : bool - Whether to fit intercept (default: False). - selection : {‘cyclic’} - Feature selection method (default: ‘cyclic’).

params_lasso_distillation_xdict or None, default=None

Parameters for X distillation Lasso. If None, uses params_lasso_screening.

refitbool, default=False

Whether to refit the model on selected features after screening.

screeningbool, default=True

Whether to perform variable screening step based on Lasso coefficients.

screening_thresholdfloat, default=10

Percentile threshold for screening (0-100). Larger values include more variables at screening. (screening_threshold=100 keeps all variables).

centeredbool, default=True

Whether to center and scale features using StandardScaler.

n_jobsint, default=1

Number of parallel jobs.

joblib_verboseint, default=0

Verbosity level for parallel jobs.

fit_ybool, default=False

Whether to fit y using selected features instead of using estimated_coef.

scaled_statisticsbool, default=False

Whether to use scaled statistics when computing importance.

random_stateint, default=2022

Random seed for reproducibility.

Attributes:
coefficient_ndarray of shape (n_features,)

Estimated feature coefficients after screening/refitting during fit method.

clf_x_list of estimators of length n_features

Fitted models for X distillation (Lasso or None if using sigma_X).

clf_y_list of estimators of length n_features

Fitted models for y distillation (sklearn estimator or None if using estimated_coef and Lasso estimator).

clf_screening_LassoCV or Lasso

Fitted screening model if estimated_coef is None.

non_selection_ndarray

Indices of features not selected after screening.

pvalues_ndarray of shape (n_features,)

Computed p-values for each feature.

importances_ndarray of shape (n_features,)

Importance scores for each feature. Test statistics following standard normal distribution.

Notes

The implementation follows Liu et al. (2022), introducing distillation to speed up conditional randomization testing. Key steps: 1. Optional screening using Lasso coefficients to reduce dimensionality. 2. Distillation to estimate conditional distributions. 3. Test statistic computation using residual correlations. 4. P-value calculation assuming Gaussian null distribution.

References

fit(X, y)[source]#

Fit the dCRT model.

This method fits the Distilled Conditional Randomization Test (DCRT) model as described in Liu et al.[2]. It performs optional feature screening using Lasso, computes coefficients, and prepares the model for importance and p-value computation.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training data matrix.

yarray-like of shape (n_samples,)

Target values.

Returns:
selfobject

Returns the fitted instance.

Notes

Main steps: 1. Optional data centering with StandardScaler 2. Lasso screening of variables (if no estimated coefficients provided) 3. Feature selection based on coefficient magnitudes 4. Model refitting on selected features (if refit=True) 5. Fit model for future distillation

The screening threshold controls which features are kept based on their Lasso coefficients. Features with coefficients below the threshold are set to zero.

References

importance(X, y)[source]#

Compute feature importance scores using distilled CRT.

Calculates test statistics and p-values for each feature using residual correlations after the distillation process.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input data matrix.

yarray-like of shape (n_samples,)

Target values.

Attributes:
importances_same as return value
pvalues_ndarray of shape (n_features,)

Two-sided p-values for each feature under Gaussian null.

Returns:
importances_ndarray of shape (n_features,)

Test statistics/importance scores for each feature. For unselected features, the score is set to 0.

Notes

For each selected feature j: 1. Computes residuals from regressing X_j on other features 2. Computes residuals from regressing y on other features 3. Calculates test statistic from correlation of residuals 4. Computes p-value assuming standard normal distribution

fit_importance(X, y, cv=None)[source]#

Fits the model to the data and computes feature importance.

A convenience method that combines fit() and importance() into a single call. First fits the dCRT model to the data, then calculates importance scores.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training data matrix.

yarray-like of shape (n_samples,)

Target values.

cvNone or int, optional (default=None)

Not used. Included for compatibility. A warning will be issued if provided.

Returns:
importancendarray of shape (n_features,)

Feature importance scores/test statistics. For features not selected during screening, scores are set to 0.

Notes

Also sets the importances_ and pvalues_ attributes on the instance. See fit() and importance() for details on the underlying computations.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

selection(k_best=None, percentile=None, threshold=None, threshold_pvalue=None)[source]#

Selects features based on variable importance. In case several arguments are different from None, the returned selection is the conjunction of all of them.

Parameters:
k_bestint, optional, default=None

Selects the top k features based on importance scores.

percentilefloat, optional, default=None

Selects features based on a specified percentile of importance scores.

thresholdfloat, optional, default=None

Selects features with importance scores above the specified threshold.

threshold_pvaluefloat, optional, default=None

Selects features with p-values below the specified threshold.

Returns:
selectionarray-like of shape (n_features,)

Binary array indicating the selected features.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.