Bootstrap Estimate Evaluation

While point estimates provide a snapshot of model performance for each subgroup, they do not capture uncertainty or statistical variability. Bootstrap estimates enhance fairness auditing by enabling confidence interval computation, statistical significance testing, and disparity analysis through repeated resampling.

EquiBoots supports bootstrap-based estimation for both classification and regression tasks. This section walks through the process of generating bootstrapped group metrics, computing disparities, and performing statistical tests to assess whether observed differences are statistically significant.

1. Bootstrap Setup

Step 1.1: Instantiate EquiBoots with Bootstrapping

To begin, we instantiate the EquiBoots class with the required inputs: the true outcome labels (y_test), predicted class labels (y_pred), predicted probabilities (y_prob), and a DataFrame that holds sensitive attributes like race or sex.

Note

y_pred, y_prob, y_test are defined inside the modeling generation section.

Note

For reccomended behaviour over 5000 bootstraps should be used. This ensures that we are properly estimating the normal distribution.

Bootstrapping is enabled by passing the required arguments during initialization. You must specify: - A list of random seeds for reproducibility - bootstrap_flag=True to enable resampling - The number of bootstrap iterations (num_bootstraps) - The sample size for each bootstrap - Optional settings for stratification and balancing

import numpy as np
import equiboots as eqb

int_list = np.linspace(0, 100, num=10, dtype=int).tolist()

eq2 = eqb.EquiBoots(
    y_true=y_test,
    y_pred=y_pred,
    y_prob=y_prob,
    fairness_df=fairness_df,
    fairness_vars=["race"],
    seeds=int_list,
    reference_groups=["White"],
    task="binary_classification",
    bootstrap_flag=True,
    num_bootstraps=5001,
    boot_sample_size=1000,
    group_min_size=150,
    balanced=False,
    stratify_by_outcome=False,
)

Step 1.2: Slice by Group and Compute Metrics

Once initialized, use the grouper() and slicer() methods to prepare bootstrapped samples for each subgroup:

eq2.grouper(groupings_vars=["race"])
boots_race_data = eq2.slicer("race")

Once this is done we get the metrics for each bootstrap, this will return a list of metrics for each bootstrap. This may take some time to run.

race_metrics = eq2.get_metrics(boots_race_data)

2. Disparity Analysis

Disparities quantify how model performance varies across subgroups relative to a reference. Here we look at the ratio.

Disparity Ratio: Metric ratio between a group and the reference

\[\text{Disparity Ratio} = \frac{M(G)}{M(R)} \quad\]

dispa = eq2.calculate_disparities(race_metrics, "race")

Plot Disparity Ratios

Use violin plots to visualize variability in disparity metrics across bootstrap iterations:

eqb.eq_group_metrics_plot(
    group_metrics=dispa,
    metric_cols=[
        "Accuracy_Ratio", "Precision_Ratio", "Predicted_Prevalence_Ratio",
        "Prevalence_Ratio", "FP_Rate_Ratio", "TN_Rate_Ratio", "Recall_Ratio",
    ],
    name="race",
    categories="all",
    plot_type="violinplot",
    color_by_group=True,
    strict_layout=True,
    figsize=(15, 8),
    leg_cols=7,
    max_cols=4,
)

Output

3. Metric Differences

EquiBoots also enables the user to look at the disparity in metric differences. The difference between the performance of the model for one group against the reference group.

Disparity Difference: Metric difference between a group and the reference

\[\text{Disparity Difference} = M(G) - M(R)\]

diffs = eq2.calculate_differences(race_metrics, "race")

eqb.eq_group_metrics_plot(
    group_metrics=diffs,
    metric_cols=[
        "Accuracy_diff", "Precision_diff", "Predicted_Prevalence_diff",
        "Prevalence_diff", "FP_Rate_diff", "TN_Rate_diff", "Recall_diff",
    ],
    name="race",
    categories="all",
    plot_type="violinplot",
    color_by_group=True,
    strict_layout=True,
    figsize=(15, 8),
    leg_cols=7,
    max_cols=4,
)

Output

4. Statistical Significance Testing

To determine whether disparities are statistically significant, EquiBoots provides bootstrap-based hypothesis testing. This involves comparing the distribution of bootstrapped metric differences to a null distribution of no effect.

metrics_boot = [
    "Accuracy_diff", "Precision_diff", "Recall_diff", "F1_Score_diff",
    "Specificity_diff", "TP_Rate_diff", "FP_Rate_diff", "FN_Rate_diff",
    "TN_Rate_diff", "Prevalence_diff", "Predicted_Prevalence_diff",
    "ROC_AUC_diff", "Average_Precision_Score_diff", "Log_Loss_diff",
    "Brier_Score_diff", "Calibration_AUC_diff"
]

test_config = {
    "test_type": "bootstrap_test",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
    "confidence_level": 0.95,
    "classification_task": "binary_classification",
    "tail_type": "two_tailed",
    "metrics": metrics_boot,
}

stat_test_results = eq2.analyze_statistical_significance(
    metric_dict=race_metrics,
    var_name="race",
    test_config=test_config,
    differences=diffs,
)

4.1: Metrics Table with Significance Annotations

You can summarize bootstrap-based statistical significance using metrics_table():

stat_metrics_table_diff = eqb.metrics_table(
    race_metrics,
    statistical_tests=stat_test_results,
    differences=diffs,
    reference_group="White",
)

Note

Asterisks (*) indicate significant omnibus test results.
Triangles (▲) indicate significant pairwise differences from the reference group.

Output

Metric	Black	Asian-Pac-Islander
Accuracy_diff	0.070 *	-0.050
Precision_diff	0.141 *	0.016
Recall_diff	-0.111	-0.119
F1_Score_diff	-0.050	-0.080
Specificity_diff	0.056 *	-0.002
TP_Rate_diff	-0.111	-0.119
FP_Rate_diff	-0.056 *	0.002
FN_Rate_diff	0.111	0.119
TN_Rate_diff	0.056 *	-0.002
Prevalence_diff	-0.122 *	0.035
Predicted_Prevalence_diff	-0.133 *	-0.016
ROC_AUC_diff	0.035	-0.041
Average_Precision_Score_diff	-0.005	-0.044
Log_Loss_diff	-0.131 *	0.113
Brier_Score_diff	-0.043 *	0.036
Calibration_AUC_diff	0.148 *	0.215 *

4.2: Visualize Differences with Significance

Finally, plot the statistically tested metric differences:

eqb.eq_group_metrics_plot(
    group_metrics=diffs,
    metric_cols=metrics_boot,
    name="race",
    categories="all",
    figsize=(20, 10),
    plot_type="violinplot",
    color_by_group=True,
    show_grid=True,
    max_cols=6,
    strict_layout=True,
    show_pass_fail=False,
    statistical_tests=stat_test_results,
)

Output

Bootstrapped Group Curve Plots

eq_plot_bootstrapped_group_curves(boot_sliced_data, curve_type='roc', common_grid=np.linspace(0, 1, 100), bar_every=10, n_bins=10, line_kwgs=None, title='Bootstrapped Curve by Group', filename='bootstrapped_curve', save_path=None, figsize=(8, 6), dpi=100, subplots=False, n_cols=2, n_rows=None, group=None, color_by_group=True, exclude_groups=0, show_grid=True, y_lim=None)

Plots bootstrapped ROC, precision-recall, or calibration curves by group. This function takes a list of bootstrapped group-level datasets and computes uncertainty bands for each curve using interpolation over a shared x-axis grid. Results can be rendered in overlay or subplot formats, with optional gridlines and curve-specific annotations (e.g., AUROC, AUCPR, or Brier score).

Parameters:

boot_sliced_data (list[dict[str, dict[str, np.ndarray]]]) – A list of bootstrap iterations, each mapping group name to ‘y_true’ and ‘y_prob’ arrays.
curve_type (str) – Type of curve to plot: ‘roc’, ‘pr’, or ‘calibration’.
common_grid (np.ndarray) – Shared x-axis points used to interpolate all curves for consistency.
bar_every (int) – Number of points between vertical error bars on the bootstrapped curve.
n_bins (int) – Number of bins for calibration plots.
line_kwgs (dict[str, Any] or None) – Optional style parameters for the diagonal or baseline reference line.
title (str) – Title of the entire plot.
filename (str) – Filename (without extension) used when saving the plot.
save_path (str or None) – Directory path to save the figure. If None, the plot is displayed instead.
figsize (tuple[float, float]) – Size of the figure as a (width, height) tuple in inches.
dpi (int) – Dots-per-inch resolution of the figure.
subplots (bool) – Whether to show each group’s curve in a separate subplot.
n_cols (int) – Number of columns in the subplot grid.
n_rows (int or None) – Number of rows in the subplot grid. Auto-calculated if None.
group (str or None) – Optional name of a single group to plot instead of all groups.
color_by_group (bool) – Whether to assign colors by group identity.
exclude_groups (int | str | list[str] | set[str]) – Groups to exclude from the plot, either by name or by minimum sample size.
show_grid (bool) – Whether to display gridlines on each plot.
y_lim (tuple[float, float] or None) – Optional y-axis limits (min, max) to enforce on the plots.

ROC AUC Curves

The example below shows bootstrapped ROC curves stratified by race group. Each curve reflects the average ROC performance across resampled iterations, with vertical error bars illustrating variability.

By toggling the subplots argument, the visualization can either overlay all group curves on a single axis (subplots=False) or display each group in its own panel (subplots=True), depending on the desired layout.

Example 1 (Overlayed Curves with Error Bars)

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="roc",
    title="Bootstrapped ROC Curve by Race",
    bar_every=100,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Output

This view helps quantify variability in model performance across subpopulations. Overlaying curves in a single plot (subplots=False) makes it easy to compare uncertainty bands side by side. Groups with insufficient data or minimal representation can be excluded using exclude_groups.

Example 2 (subplots=True)

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="roc",
    title="Bootstrapped ROC Curve by Race",
    bar_every=100,
    subplots=True,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Output

Bootstrapped ROC AUC Curves by Race (subplots)

This multi‐panel layout makes side-by-side comparison of each group’s uncertainty bands straightforward.

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="pr",
    title="Bootstrapped PR Curve by Race",
    filename="boot_roc_race",
    save_path="./images",
    subplots=True,
    bar_every=100,
    # n_rows=1,
    n_cols=1,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Precision-Recall Curves

The example below presents bootstrapped precision-recall (PR) curves grouped by race. Each curve illustrates the average precision-recall relationship across bootstrapped samples, with vertical error bars indicating the variability at select recall thresholds.

As with ROC curves, setting subplots=False overlays all groups in a single plot, allowing for compact comparison. Alternatively, setting subplots=True creates individual panels for each group to better visualize variations in precision across recall levels.

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="pr",
    title="Bootstrapped PR Curve by Race",
    subplots=True,
    bar_every=100,
    n_cols=1,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Output

Bootstrapped PR Curves by Race (subplots)

Subplot mode offers a cleaner side-by-side comparison of each group’s bootstrapped precision-recall behavior, making small differences in model performance easier to interpret.

Calibration Curves

The following example visualizes bootstrapped calibration curves grouped by race. Each curve reflects the average alignment between predicted probabilities and observed outcomes, aggregated over multiple resampled datasets. Vertical bars show variability in the calibration estimate at evenly spaced probability intervals.

As with ROC and PR plots, subplots=False will overlay all group curves on one axis, while subplots=True generates a separate panel for each group.

Example 1 (Overlayed Calibration Curves with Error Bars)

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="calibration",
    title="Bootstrapped Calibration Curve by Race",
    subplots=True,
    bar_every=10,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Output

Bootstrapped Calibration Curves by Race (overlayed)

Using subplots offers a focused view of calibration accuracy for each group, allowing nuanced inspection of where the model’s confidence aligns or diverges from observed outcomes.

Summary

Bootstrapping provides a rigorous and interpretable framework for evaluating fairness by estimating uncertainty in performance metrics, computing disparities, and identifying statistically significant differences between groups.

Use EquiBoots to support robust fairness audits that go beyond simple point comparisons and account for sampling variability and multiple comparisons.