.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "generated/gallery/examples/plot_2D_simulation_example.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_generated_gallery_examples_plot_2D_simulation_example.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_generated_gallery_examples_plot_2D_simulation_example.py:


Ensemble Clustered Inference on 2D Data
=======================================

In this example, we present how to perform inference on simulated 2D data. This setting
is particularly challenging since the number of features (pixels in 2D) is much larger
than the number of samples. We first illustrate the limitations of the Desparsified
Lasso method in this setting and then present two methods that leverage the data's
spatial structure to build clusters and perform the inference at the cluster level.
We first show how to use clustered inference with DL (:class:`hidimstat.CluDL`).
We then show how to use clustered inference with Ensembled CluDL
(:class:`hidimstat.EnCluDL`).

.. GENERATED FROM PYTHON SOURCE LINES 16-23

Generating the data
-------------------
We begin by simulating 2D data where the support (i.e., the predictive features)
consists of four regions of neighboring pixels located in each corner of the 2D image.
The target variable :math:`y` is a continuous variable generated from a linear model. To
make the problem more challenging, the pixels are spatially correlated using a
Gaussian filter.

.. GENERATED FROM PYTHON SOURCE LINES 23-64

.. code-block:: Python


    import matplotlib.pyplot as plt
    import seaborn as sns
    from matplotlib.colors import ListedColormap
    from matplotlib.patches import Patch

    from hidimstat._utils.scenario import multivariate_simulation_spatial

    n_samples = 100
    shape = (40, 40)
    n_features = shape[1] * shape[0]
    roi_size = 5  # size of the edge of the four predictive regions
    signal_noise_ratio = 16.0  # noise standard deviation
    smooth_X = 1.0  # level of spatial smoothing introduced by the Gaussian filter

    # generating the data
    X_init, y, beta, epsilon = multivariate_simulation_spatial(
        n_samples, shape, roi_size, signal_noise_ratio, smooth_X, seed=0
    )
    print(f"Number of samples: {X_init.shape[0]}, Number of features: {n_features}")


    # visualize the data
    cmap = ListedColormap(["white", "tab:green"])  # 0 -> blue (null), 1 -> green (support)
    fig, ax = plt.subplots(figsize=(4, 4), subplot_kw={"xticks": [], "yticks": []})
    ax.imshow(
        beta.reshape(shape),
        cmap=cmap,
        vmin=0,
        vmax=1,
    )

    # Legend: green = support, blue = null
    legend_handles = [
        Patch(facecolor="tab:green", edgecolor="k", label="Support"),
        Patch(facecolor="white", edgecolor="k", label="Null"),
    ]
    ax.legend(handles=legend_handles, loc="lower center")
    plt.tight_layout()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_001.png
   :alt: plot 2D simulation example
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Number of samples: 100, Number of features: 1600


.. GENERATED FROM PYTHON SOURCE LINES 65-74

Inference with Desparsified Lasso
---------------------------------
First, we perform inference using the Desparsified Lasso method, treating the data as
a standard high-dimensional regression problem without considering its spatial
structure. The aim of the inference step is to recover the support while controlling
the Family-Wise Error Rate (FWER) at a targeted level of 0.1. To achieve this, we use
the Bonferroni correction, applying a factor equal to the number of features. For more
details about the Desparsified Lasso method, see :footcite:t:`javanmard2014confidence`,
:footcite:t:`zhang2014confidence` and :footcite:t:`van2014asymptotically`.

.. GENERATED FROM PYTHON SOURCE LINES 74-100

.. code-block:: Python


    import numpy as np
    from sklearn.linear_model import LassoCV

    from hidimstat import DesparsifiedLasso

    fwer_target = 0.1
    n_jobs = 5  # number of workers for parallel computing

    # compute desparsified lasso
    estimator = LassoCV(max_iter=1000, tol=0.0001, eps=0.01, fit_intercept=False)
    dl = DesparsifiedLasso(estimator=estimator, n_jobs=n_jobs, random_state=0)
    dl.fit_importance(X_init, y)

    # compute estimated support (first method)
    selected_dl = dl.fwer_selection(fwer=fwer_target, n_tests=n_features)

    tp_mask = ((selected_dl.astype(int) == 1) & (beta == 1)).astype(bool)
    fp_mask = ((selected_dl.astype(int) == 1) & (beta == 0)).astype(bool)
    fn_mask = ((selected_dl.astype(int) == 0) & (beta == 1)).astype(bool)
    mask_dl = np.zeros(shape)
    mask_dl[tp_mask.reshape(shape)] = 1
    mask_dl[fp_mask.reshape(shape)] = -2
    mask_dl[fn_mask.reshape(shape)] = -1


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/circleci/project/.venv/lib/python3.13/site-packages/sklearn/linear_model/_coordinate_descent.py:695: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.696e-01, tolerance: 8.364e-01
      model = cd_fast.enet_coordinate_descent(


.. GENERATED FROM PYTHON SOURCE LINES 101-102

We visualize the support estimated by the Desparsified Lasso method.

.. GENERATED FROM PYTHON SOURCE LINES 102-115

.. code-block:: Python


    _, ax = plt.subplots(figsize=(4, 4), subplot_kw={"xticks": [], "yticks": []})
    cmap = ListedColormap(["tab:red", "tab:purple", "white", "tab:green"])
    ax.imshow(mask_dl, cmap=cmap, vmin=-2, vmax=1)
    legend_handles = [
        Patch(facecolor="tab:green", edgecolor="k", label="True Positive"),
        Patch(facecolor="tab:red", edgecolor="k", label="False Positive"),
        Patch(facecolor="tab:purple", edgecolor="k", label="False Negative"),
    ]
    ax.legend(handles=legend_handles, loc="lower center")
    plt.tight_layout()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_002.png
   :alt: plot 2D simulation example
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 116-121

It can be seen that:
**1.** the Desparsified Lasso method is not powerful enough to recover the support,
it only selects scattered pixels as true positives without identifying the regions.
**2.** the number of false positives is quite high, and sometimes false positives are
located far from the true support.

.. GENERATED FROM PYTHON SOURCE LINES 124-134

Clustered inference with CluDL
-------------------------------
To improve the power of the inference, we can leverage the spatial structure of the
data. The idea is to group correlated pixels into clusters, using an additional
spatial constraint (pixels are iteratively merged with neighboring pixels). This
approach leads to a dimension reduction while preserving the data's spatial structure.
To control the FWER at the targeted level of 0.1, we perform Bonferroni correction,
but here the correction factor is equal to the number of clusters instead of the
number of features. For more details about CluDL, see
:footcite:t:`chevalier2022spatially`.

.. GENERATED FROM PYTHON SOURCE LINES 134-155

.. code-block:: Python


    from sklearn.base import clone
    from sklearn.cluster import FeatureAgglomeration
    from sklearn.feature_extraction import image

    from hidimstat import CluDL

    # Clustering step
    n_clusters = 200
    connectivity = image.grid_to_graph(n_x=shape[0], n_y=shape[1])
    clustering = FeatureAgglomeration(
        n_clusters=n_clusters, connectivity=connectivity, linkage="ward"
    )

    dl_2 = DesparsifiedLasso(estimator=clone(estimator), n_jobs=n_jobs)
    clu_dl = CluDL(desparsified_lasso=dl_2, clustering=clustering, random_state=0)
    clu_dl.fit_importance(X_init, y)

    selected_cdl = clu_dl.fwer_selection(fwer=fwer_target, n_tests=n_clusters)


.. GENERATED FROM PYTHON SOURCE LINES 156-157

Visualizing the support estimated by CluDL

.. GENERATED FROM PYTHON SOURCE LINES 157-178

.. code-block:: Python


    tp_mask = ((selected_cdl.astype(int) == 1) & (beta == 1)).astype(bool)
    fp_mask = ((selected_cdl.astype(int) == 1) & (beta == 0)).astype(bool)
    fn_mask = ((selected_cdl.astype(int) == 0) & (beta == 1)).astype(bool)
    mask_cludl = np.zeros(shape)
    mask_cludl[tp_mask.reshape(shape)] = 1
    mask_cludl[fp_mask.reshape(shape)] = -2
    mask_cludl[fn_mask.reshape(shape)] = -1

    _, ax = plt.subplots(figsize=(4, 4), subplot_kw={"xticks": [], "yticks": []})
    cmap = ListedColormap(["tab:red", "tab:purple", "white", "tab:green"])
    ax.imshow(mask_cludl, cmap=cmap, vmin=-2, vmax=1)
    legend_handles = [
        Patch(facecolor="tab:green", edgecolor="k", label="True Positive"),
        Patch(facecolor="tab:red", edgecolor="k", label="False Positive"),
        Patch(facecolor="tab:purple", edgecolor="k", label="False Negative"),
    ]
    ax.legend(handles=legend_handles, loc="lower center")
    plt.tight_layout()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_003.png
   :alt: plot 2D simulation example
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 179-183

With ``CluDL``, the support recovery is more powerful, with large regions of the true
support being correctly identified. However, some false positives remain. In this
case, these false positives consist of small clusters that can be either contiguous to
the true support or located far from it.

.. GENERATED FROM PYTHON SOURCE LINES 186-197

Inference with Ensembled CluDL
------------------------------
Finally, we perform inference using an ensembled version of ``CluDL`` called ``EnCluDL``. The
idea is to run several ``CluDL`` algorithms with different clustering choices. By
repeating the clustering step on different bootstrap samples of the data, this
approach derandomizes the clustering choice and makes it more robust to small
variations in the data. It can be efficiently parallelized since the different ``CluDL``
runs are independent and thus embarrassingly parallel. Similar to ``CluDL``, we perform
Bonferroni correction with a factor equal to the number of clusters on the p-values
obtained by aggregating the different ``CluDL`` runs. For more details about ``EnCluDL``,
see :footcite:t:`chevalier2022spatially`.

.. GENERATED FROM PYTHON SOURCE LINES 197-213

.. code-block:: Python


    from hidimstat.ensemble_clustered_inference import EnCluDL

    enclu_dl = EnCluDL(
        desparsified_lasso=DesparsifiedLasso(estimator=clone(estimator), n_jobs=1),
        clustering=clustering,
        n_bootstraps=20,
        random_state=0,
        n_jobs=n_jobs,
        cluster_boostrap_size=0.5,
    )
    enclu_dl.fit_importance(X_init, y)

    selected_ecdl = enclu_dl.fwer_selection(fwer=fwer_target, n_tests=n_clusters)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Fitting clustered inferences:   0%|          | 0/20 [00:00<?, ?it/s]    Fitting clustered inferences:  50%|█████     | 10/20 [00:00<00:00, 20.30it/s]    Fitting clustered inferences:  75%|███████▌  | 15/20 [00:00<00:00, 14.18it/s]    Fitting clustered inferences: 100%|██████████| 20/20 [00:01<00:00, 11.42it/s]    Fitting clustered inferences: 100%|██████████| 20/20 [00:01<00:00, 12.70it/s]
    Computing importances:   0%|          | 0/20 [00:00<?, ?it/s]    Computing importances: 100%|██████████| 20/20 [00:00<00:00, 335.46it/s]


.. GENERATED FROM PYTHON SOURCE LINES 214-215

Visualizing the support estimated by EnCluDL

.. GENERATED FROM PYTHON SOURCE LINES 215-237

.. code-block:: Python


    tp_mask = ((selected_ecdl.astype(int) == 1) & (beta == 1)).astype(bool)
    fp_mask = ((selected_ecdl.astype(int) == 1) & (beta == 0)).astype(bool)
    fn_mask = ((selected_ecdl.astype(int) == 0) & (beta == 1)).astype(bool)
    mask_encludl = np.zeros(shape)
    mask_encludl[tp_mask.reshape(shape)] = 1
    mask_encludl[fp_mask.reshape(shape)] = -2
    mask_encludl[fn_mask.reshape(shape)] = -1


    _, ax = plt.subplots(figsize=(4, 4), subplot_kw={"xticks": [], "yticks": []})
    cmap = ListedColormap(["tab:red", "tab:purple", "white", "tab:green"])
    ax.imshow(mask_encludl, cmap=cmap, vmin=-2, vmax=1)
    legend_handles = [
        Patch(facecolor="tab:green", edgecolor="k", label="True Positive"),
        Patch(facecolor="tab:red", edgecolor="k", label="False Positive"),
        Patch(facecolor="tab:purple", edgecolor="k", label="False Negative"),
    ]
    ax.legend(handles=legend_handles, loc="lower center")
    plt.tight_layout()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_004.png
   :alt: plot 2D simulation example
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_004.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 238-243

Compared to ``CluDL``, ``EnCluDL`` further improves the inference, primarily by reducing the
number of false positives located far from the true support. An intuitive explanation
for this improvement is that while false discoveries far from the true support may
occur randomly in a single clustering, they are less likely to occur repeatedly in
overlapping clusters obtained from different bootstrap samples of the data.

.. GENERATED FROM PYTHON SOURCE LINES 246-255

Support recovery with spatial tolerance
---------------------------------------
For spatially structured data, false discoveries that are close to the true support,
or even part of clusters that intersect the true support, are likely to change less
the interpretation of the results than false discoveries located far from the support.
Therefore, it can be relevant to consider a spatially relaxed support recovery,
The idea is to penalize less those false discoveries that are close
to the true support. To achieve this, we introduce an extended support that includes
the true support and a tolerance region around it.

.. GENERATED FROM PYTHON SOURCE LINES 255-278

.. code-block:: Python


    spatial_tolerance = 3
    roi_size_extended = roi_size + spatial_tolerance
    beta_extended = beta.copy().reshape(shape)
    beta_extended[0:roi_size_extended, 0:roi_size_extended] += 1
    beta_extended[-roi_size_extended:, -roi_size_extended:] += 1
    beta_extended[0:roi_size_extended, -roi_size_extended:] += 1
    beta_extended[-roi_size_extended:, 0:roi_size_extended] += 1

    # visualize the extended support
    cmap = ListedColormap(["white", "tab:orange", "tab:green"])
    fig, ax = plt.subplots(figsize=(4, 4), subplot_kw={"xticks": [], "yticks": []})
    ax.imshow(beta_extended.reshape(shape), cmap=cmap, vmin=0, vmax=2)
    # Legend: green = support, yellow = tolerance region, white = null
    legend_handles = [
        Patch(facecolor="tab:green", edgecolor="k", label="Support"),
        Patch(facecolor="tab:orange", edgecolor="k", label="Tolerance region"),
        Patch(facecolor="white", edgecolor="k", label="Null"),
    ]
    ax.legend(handles=legend_handles, loc="lower center")
    plt.tight_layout()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_005.png
   :alt: plot 2D simulation example
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_005.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 279-283

Comparison with spatial tolerance
---------------------------------
Now, we compare the three methods presented above (Desparsified Lasso, CluDL and
EnCluDL) in terms of support recovery with spatial tolerance.

.. GENERATED FROM PYTHON SOURCE LINES 283-315

.. code-block:: Python


    def compute_spatially_relaxed_mask(mask, beta_extended):
        """Mark the false positives that are in the tolerance with a special value, -3"""
        support = beta_extended.reshape(shape) >= 1
        mask[support & (mask == -2)] = -3
        return mask


    mask_dl_relaxed = compute_spatially_relaxed_mask(mask_dl, beta_extended)
    mask_cludl_relaxed = compute_spatially_relaxed_mask(mask_cludl, beta_extended)
    mask_encldl_relaxed = compute_spatially_relaxed_mask(mask_encludl, beta_extended)
    cmap = ListedColormap(["tab:orange", "tab:red", "tab:purple", "white", "tab:green"])

    # sphinx_gallery_thumbnail_number = 6
    _, axes = plt.subplots(1, 3, figsize=(10, 4), subplot_kw={"xticks": [], "yticks": []})
    axes[0].imshow(mask_dl_relaxed, cmap=cmap, vmin=-3, vmax=1)
    axes[0].set_title("Desparsified Lasso", fontweight="bold")
    axes[1].imshow(mask_cludl_relaxed, cmap=cmap, vmin=-3, vmax=1)
    axes[1].set_title("CluDL", fontweight="bold")
    axes[2].imshow(mask_encldl_relaxed, cmap=cmap, vmin=-3, vmax=1)
    axes[2].set_title("EnCluDL", fontweight="bold")
    legend_handles = [
        Patch(facecolor="tab:green", edgecolor="k", label="True Positive"),
        Patch(facecolor="tab:orange", edgecolor="k", label="False Positive in tolerance"),
        Patch(facecolor="tab:red", edgecolor="k", label="False Positive out of tolerance"),
        Patch(facecolor="tab:purple", edgecolor="k", label="False Negative"),
    ]
    axes[0].legend(handles=legend_handles, loc="lower center")
    plt.tight_layout()


.. image-sg:: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_006.png
   :alt: Desparsified Lasso, CluDL, EnCluDL
   :srcset: /generated/gallery/examples/images/sphx_glr_plot_2D_simulation_example_006.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 316-332

Note: Choosing inference parameters
-----------------------------------
The choice of the number of clusters depends on several parameters, such as:
the structure of the data (a higher correlation between neighboring features
enable a greater dimension reduction, i.e. a smaller number of clusters),
the number of samples (small datasets require more dimension reduction) and
the required spatial tolerance (small clusters lead to limited spatial
uncertainty). Formally, "spatial tolerance" is defined by the largest
distance from the true support for which the occurrence of a false discovery
is not statistically controlled (c.f. :footcite:t:`chevalier2022spatially`).
Theoretically, the spatial tolerance ``delta`` is equal to the largest
cluster diameter. However this choice is conservative, notably in the case
of ensembled clustered inference. For these algorithms, we recommend to take
the average cluster radius. In this example, we choose ``n_clusters = 200``,
leading to a theoretical spatial tolerance ``delta = 6``, which is still
conservative (see Results).

.. GENERATED FROM PYTHON SOURCE LINES 335-338

References
----------
.. footbibliography::


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 17.707 seconds)

**Estimated memory usage:**  222 MB


.. _sphx_glr_download_generated_gallery_examples_plot_2D_simulation_example.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_2D_simulation_example.ipynb <plot_2D_simulation_example.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_2D_simulation_example.py <plot_2D_simulation_example.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_2D_simulation_example.zip <plot_2D_simulation_example.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_