ModelXKnockoff#

class hidimstat.ModelXKnockoff(estimator=LassoCV(cv=KFold(n_splits=5, random_state=0, shuffle=True), max_iter=200000, n_jobs=1, random_state=1, tol=1e-06, verbose=0), ko_generator=<hidimstat.samplers.gaussian_knockoffs.GaussianKnockoffs object>, n_repeats=1, centered=True, preconfigure_lasso_path=True, random_state=None, joblib_verbose=0, memory=None, n_jobs=1)[source]#

Bases: BaseVariableImportance

Model-X Knockoff

This module implements the Model-X knockoff inference procedure, which is an approach to control the False Discovery Rate (FDR) based on Candes et al.[1]. The original implementation can be found at msesia/knockoff-filter The noisy variables are generated with second-order knockoff variables using the equi-correlated method.

In addition, this function generates multiple sets of Gaussian knockoff variables and calculates the test statistics for each set. It then aggregates the test statistics across the sets to improve stability and power.

Parameters:
estimatorestimator, default=LassoCV(…)

Estimator used to compute knockoff statistics. Must expose coefficients via coef_ (or best_estimator_.coef_ for CV wrappers) after fit.

ko_generatorobject

Knockoff generator implementing fit(X) and sample(n_repeats, random_state).

n_repeats: int, default=1

Number of knockoff draws to average over.

centeredbool, default=True

If True, standardize X before fitting the generator and computing statistics.

preconfigure_lasso_pathbool, default=True

An optional function is called to configure the LassoCV estimator’s regularization path. The maximum alpha is computed as alpha_max = max(X_ko.T @ y) / (2 * n_features) and an alpha grid of length n_alphas is created between alpha_max * exp(-n_alphas) and alpha_max.

random_stateint or None, default=None

Random seed forwarded to the knockoff generator sampling.

joblib_verboseint, default=0

Verbosity level for parallel jobs.

memorystr, joblib.Memory or None, default=None

Caching backend for expensive operations.

n_jobsint, default=1

Number of parallel jobs (automatically capped to n_repeats).

Attributes:
importances_ndarray, shape (n_repeats, n_features)

Test statistics for each repeat.

pvalues_ndarray, shape (n_repeats, n_features)

Empirical p-values for each repeat.

threshold_fdr_float

Threshold computed by the FDR selection procedure.

aggregated_pval_ndarray or None

Aggregated p-values (when using p-value aggregation).

aggregated_eval_ndarray or None

Aggregated e-values (when using e-value aggregation).

estimators_list of estimators

List of fitted estimators on the concatenated design matrices for each repeat.

n_features_int

Number of features on which the model was fitted.

Notes

Use the model_x_knockoff function for a functional interface that wraps this class. The class focuses on generator fitting, repeated knockoff sampling, computing statistics and performing FDR-based selection.

__init__(estimator=LassoCV(cv=KFold(n_splits=5, random_state=0, shuffle=True), max_iter=200000, n_jobs=1, random_state=1, tol=1e-06, verbose=0), ko_generator=<hidimstat.samplers.gaussian_knockoffs.GaussianKnockoffs object>, n_repeats=1, centered=True, preconfigure_lasso_path=True, random_state=None, joblib_verbose=0, memory=None, n_jobs=1)[source]#
fit(X, y)[source]#

Fit the knockoff generator and estimators to the data.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training data matrix where n_samples is the number of samples and n_features is the number of features.

yarray-like of shape (n_samples,)

Target values.

Returns:
selfobject

Returns the instance itself.

importance(X=None, y=None)[source]#

Calculate feature importance scores using Model-X knockoffs.

This method generates knockoff variables and computes test statistics to measure feature importance. For multiple repeats, the scores are averaged across repeats to improve stability.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training data matrix where n_samples is the number of samples and n_features is the number of features.

yarray-like of shape (n_samples,)

Target values.

Returns:
importances_ndarray of shape (n_features,)

Feature importance scores for each feature. Higher absolute values indicate higher importance.

Notes

The method generates knockoff variables that satisfy the exchangeability property and computes test statistics comparing original features against their knockoffs. When n_repeats > 1, multiple sets of knockoffs are generated and results are averaged.

fit_importance(X, y)[source]#

Fits the model to the data and computes feature importance.

Parameters:
Xarray-like of shape (n_samples, n_features)

The input data matrix where n_samples is the number of samples and n_features is the number of features.

yarray-like of shape (n_samples,)

The target values.

cvNone or cross-validation generator, default=None

Cross-validation parameter. Not used in this method. A warning will be issued if provided.

Returns:
importances_ndarray of shape (n_features,)

Feature importance scores (p-values) for each feature. Lower values indicate higher importance. Values range from 0 to 1.

See also

fit

Method for fitting the generator only

importance

Method for computing importance scores only

Notes

This method combines the fit and importance computation steps. It first fits the generator to X and then computes importance scores by comparing observed test statistics against permuted ones.

fdr_selection(fdr, fdr_control='bhq', evalues=False, reshaping_function=None, adaptive_aggregation=False, gamma=0.5)[source]#

Performs feature selection based on False Discovery Rate (FDR) control.

This method selects features by controlling the FDR using either p-values or e-values derived from test scores. It supports different FDR control methods and optional adaptive aggregation of the statistical values.

Parameters:
fdrfloat, default=None

The target false discovery rate level (between 0 and 1)

fdr_control: string, default=”bhq”

The FDR control method to use. Options are: - “bhq”: Benjamini-Hochberg procedure - ‘bhy’: Benjamini-Hochberg-Yekutieli procedure - “ebh”: e-BH procedure (only for e-values)

evalues: boolean, default=False

If True, uses e-values for selection. If False, uses p-values.

reshaping_function: callable, default=None

Reshaping function for BHY method, default uses sum of reciprocals

adaptive_aggregation: boolean, default=False

If True, uses adaptive weights for p-value aggregation. Only applicable when evalues=False.

gamma: boolean, default=0.5

The gamma parameter for quantile aggregation of p-values. Only used when evalues=False.

Returns:
numpy.ndarray

Boolean array indicating selected features (True for selected, False for not selected)

Raises:
AssertionError

If importances_ is None or if incompatible combinations of parameters are provided

static lasso_coefficient_difference_statistic(estimators, n_features)[source]#

Compute the Lasso Coefficient-Difference (LCD) statistic from a fitted estimator. Given a list of fitted estimators on the concatenated design matrix [X, X_tilde], this function computes the knockoff statistic for each original feature across repeats:

\[W_j = |\beta_j| - |\beta_j'|\]

where \(\beta_j\) and \(\beta_j'\) are the fitted coefficients for the original feature j and its knockoff counterpart j’.

Parameters:
estimatorslist of estimators

List of fitted estimators on the concatenated design matrix [X, X_tilde]. Each estimator must expose coefficients via coef_ or best_estimator_.coef_.

n_featuresint

Number of original features (not including knockoffs).

Returns:
test_statisticndarray, shape (n_repeats, n_features)

Knockoff statistics \(W_j\) for each original feature across repeats. The number of repeats corresponds to the length of the estimators list.

static knockoff_threshold(test_score, fdr=0.1)[source]#

Calculate the knockoff threshold based on the procedure stated in the article.

Original code: msesia/knockoff-filter

Parameters:
test_score1D ndarray, shape (n_features, )

Vector of test statistic.

fdrfloat

Desired controlled FDR (false discovery rate) level.

Returns:
thresholdfloat or np.inf

Threshold level.

fwer_selection(fwer, procedure='bonferroni', n_tests=None, two_tailed_test=False)[source]#

Performs feature selection based on Family-Wise Error Rate (FWER) control.

Parameters:
fwerfloat

The target family-wise error rate level (between 0 and 1)

procedure{‘bonferroni’}, default=’bonferroni’

The FWER control method to use: - ‘bonferroni’: Bonferroni correction

n_testsint or None, default=None

Factor for multiple testing correction. If None, uses the number of clusters or the number of features in this order.

two_tailed_testbool, default=False

If True, uses the sign of the importance scores to indicate whether the selected features have positive or negative effects.

Returns:
selectedndarray of int

Integer array indicating the selected features. 1 indicates selected features with positive effects, -1 indicates selected features with negative effects, 0 indicates non-selected features.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

importance_selection(k_best=None, percentile=None, threshold_max=None, threshold_min=None)[source]#

Selects features based on variable importance.

Parameters:
k_bestint, default=None

Selects the top k features based on importance scores.

percentilefloat, default=None

Selects features based on a specified percentile of importance scores.

threshold_maxfloat, default=None

Selects features with importance scores below the specified maximum threshold.

threshold_minfloat, default=None

Selects features with importance scores above the specified minimum threshold.

Returns:
selectionarray-like of shape (n_features,)

Binary array indicating the selected features.

plot_importance(ax=None, ascending=False, feature_names=None, **seaborn_barplot_kwargs)[source]#

Plot feature importances as a horizontal bar plot.

Parameters:
axmatplotlib.axes.Axes or None, (default=None)

Axes object to draw the plot onto, otherwise uses the current Axes.

ascending: bool, default=False

Whether to sort features by ascending importance.

**seaborn_barplot_kwargsadditional keyword arguments

Additional arguments passed to seaborn.barplot. https://seaborn.pydata.org/generated/seaborn.barplot.html

Returns:
axmatplotlib.axes.Axes

The Axes object with the plot.

pvalue_selection(k_lowest=None, percentile=None, threshold_max=0.05, threshold_min=None, alternative_hypothesis=False)[source]#

Selects features based on p-values.

Parameters:
k_lowestint, default=None

Selects the k features with lowest p-values.

percentilefloat, default=None

Selects features based on a specified percentile of p-values.

threshold_maxfloat, default=0.05

Selects features with p-values below the specified maximum threshold (0 to 1).

threshold_minfloat, default=None

Selects features with p-values above the specified minimum threshold (0 to 1).

alternative_hypothesisbool, default=False

If True, selects based on 1-pvalues instead of p-values.

Returns:
selectionarray-like of shape (n_features,)

Binary array indicating the selected features (True for selected).

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

Examples using hidimstat.ModelXKnockoff#

Knockoff aggregation

Knockoff aggregation

Controlled multiple variable selection on the Wisconsin breast cancer dataset

Controlled multiple variable selection on the Wisconsin breast cancer dataset