hidimstat.D0CRT#
- class hidimstat.D0CRT(estimator, method: str = 'predict', estimated_coef=None, sigma_X=None, params_lasso_screening={'alpha': None, 'alpha_max_fraction': 0.5, 'alphas': None, 'cv': 5, 'fit_intercept': False, 'max_iter': 1000, 'n_alphas': 10, 'selection': 'cyclic', 'tol': 1e-06}, params_lasso_distillation_x=None, refit=False, screening=True, screening_threshold=10, centered=True, n_jobs=1, joblib_verbose=0, fit_y=False, scaled_statistics=False, random_state=2022)[source]#
Bases:
BaseVariableImportance
Implements distilled conditional randomization test (dCRT) without interactions.
This class provides a fast implementation of the Conditional Randomization Test Candes et al.[1] using the distillation process from Liu et al.[2]. The approach accelerates variable selection by combining Lasso-based screening and residual-based test statistics. Based on the original implementation at: moleibobliu/Distillation-CRT The y-distillation is based on a given estimator and the x-distillation is based on a Lasso estimator.
- Parameters:
- estimatorsklearn estimator
The base estimator used for y-distillation and prediction (e.g., Lasso, RandomForest, …).
- methodstr, default=”predict”
Method of the estimator to use for predictions (“predict”, “predict_proba”, etc.).
- estimated_coefarray-like of shape (n_features,) or None, default=None
Pre-computed feature coefficients. If None, coefficients are estimated via Lasso.
- sigma_Xarray-like of shape (n_features, n_features) or None, default=None
Covariance matrix of X. If None, Lasso is used for X distillation.
- params_lasso_screeningdict
Parameters for variable screening Lasso: - alpha : float or None - L1 regularization strength. If None, determined by CV. - n_alphas : int - Number of alphas for cross-validation (default: 10). - alphas : array-like or None - List of alpha values to try in CV (default: None). - alpha_max_fraction : float - Scale factor for alpha_max (default: 0.5). - cv : int - Cross-validation folds (default: 5). - tol : float - Convergence tolerance (default: 1e-6). - max_iter : int - Maximum iterations (default: 1000). - fit_intercept : bool - Whether to fit intercept (default: False). - selection : {‘cyclic’} - Feature selection method (default: ‘cyclic’).
- params_lasso_distillation_xdict or None, default=None
Parameters for X distillation Lasso. If None, uses params_lasso_screening.
- refitbool, default=False
Whether to refit the model on selected features after screening.
- screeningbool, default=True
Whether to perform variable screening step based on Lasso coefficients.
- screening_thresholdfloat, default=10
Percentile threshold for screening (0-100). Larger values include more variables at screening. (screening_threshold=100 keeps all variables).
- centeredbool, default=True
Whether to center and scale features using StandardScaler.
- n_jobsint, default=1
Number of parallel jobs.
- joblib_verboseint, default=0
Verbosity level for parallel jobs.
- fit_ybool, default=False
Whether to fit y using selected features instead of using estimated_coef.
- scaled_statisticsbool, default=False
Whether to use scaled statistics when computing importance.
- random_stateint, default=2022
Random seed for reproducibility.
- Attributes:
- coefficient_ndarray of shape (n_features,)
Estimated feature coefficients after screening/refitting during fit method.
- clf_x_list of estimators of length n_features
Fitted models for X distillation (Lasso or None if using sigma_X).
- clf_y_list of estimators of length n_features
Fitted models for y distillation (sklearn estimator or None if using estimated_coef and Lasso estimator).
- clf_screening_LassoCV or Lasso
Fitted screening model if estimated_coef is None.
- non_selection_ndarray
Indices of features not selected after screening.
- pvalues_ndarray of shape (n_features,)
Computed p-values for each feature.
- importances_ndarray of shape (n_features,)
Importance scores for each feature. Test statistics following standard normal distribution.
Notes
The implementation follows Liu et al. (2022), introducing distillation to speed up conditional randomization testing. Key steps: 1. Optional screening using Lasso coefficients to reduce dimensionality. 2. Distillation to estimate conditional distributions. 3. Test statistic computation using residual correlations. 4. P-value calculation assuming Gaussian null distribution.
References
- fit(X, y)[source]#
Fit the dCRT model.
This method fits the Distilled Conditional Randomization Test (DCRT) model as described in Liu et al.[2]. It performs optional feature screening using Lasso, computes coefficients, and prepares the model for importance and p-value computation.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Training data matrix.
- yarray-like of shape (n_samples,)
Target values.
- Returns:
- selfobject
Returns the fitted instance.
Notes
Main steps: 1. Optional data centering with StandardScaler 2. Lasso screening of variables (if no estimated coefficients provided) 3. Feature selection based on coefficient magnitudes 4. Model refitting on selected features (if refit=True) 5. Fit model for future distillation
The screening threshold controls which features are kept based on their Lasso coefficients. Features with coefficients below the threshold are set to zero.
References
- importance(X, y)[source]#
Compute feature importance scores using distilled CRT.
Calculates test statistics and p-values for each feature using residual correlations after the distillation process.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input data matrix.
- yarray-like of shape (n_samples,)
Target values.
- Attributes:
- importances_same as return value
- pvalues_ndarray of shape (n_features,)
Two-sided p-values for each feature under Gaussian null.
- Returns:
- importances_ndarray of shape (n_features,)
Test statistics/importance scores for each feature. For unselected features, the score is set to 0.
Notes
For each selected feature j: 1. Computes residuals from regressing X_j on other features 2. Computes residuals from regressing y on other features 3. Calculates test statistic from correlation of residuals 4. Computes p-value assuming standard normal distribution
- fit_importance(X, y, cv=None)[source]#
Fits the model to the data and computes feature importance.
A convenience method that combines fit() and importance() into a single call. First fits the dCRT model to the data, then calculates importance scores.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Training data matrix.
- yarray-like of shape (n_samples,)
Target values.
- cvNone or int, optional (default=None)
Not used. Included for compatibility. A warning will be issued if provided.
- Returns:
- importancendarray of shape (n_features,)
Feature importance scores/test statistics. For features not selected during screening, scores are set to 0.
Notes
Also sets the importances_ and pvalues_ attributes on the instance. See fit() and importance() for details on the underlying computations.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- selection(k_best=None, percentile=None, threshold=None, threshold_pvalue=None)[source]#
Selects features based on variable importance. In case several arguments are different from None, the returned selection is the conjunction of all of them.
- Parameters:
- k_bestint, optional, default=None
Selects the top k features based on importance scores.
- percentilefloat, optional, default=None
Selects features based on a specified percentile of importance scores.
- thresholdfloat, optional, default=None
Selects features with importance scores above the specified threshold.
- threshold_pvaluefloat, optional, default=None
Selects features with p-values below the specified threshold.
- Returns:
- selectionarray-like of shape (n_features,)
Binary array indicating the selected features.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.