Bootstrap Estimate Evaluation

While point estimates provide a snapshot of model performance for each subgroup, they do not capture uncertainty or statistical variability. Bootstrap estimates enhance fairness auditing by enabling confidence interval computation, statistical significance testing, and disparity analysis through repeated resampling.

EquiBoots supports bootstrap-based estimation for both classification and regression tasks. This section walks through the process of generating bootstrapped group metrics, computing disparities, and performing statistical tests to assess whether observed differences are statistically significant.

1. Bootstrap Setup

Step 1.1: Instantiate EquiBoots with Bootstrapping

To begin, we instantiate the EquiBoots class with the required inputs: the true outcome labels (y_test), predicted class labels (y_pred), predicted probabilities (y_prob), and a DataFrame that holds sensitive attributes like race or sex.

Note

y_pred, y_prob, y_test are defined inside the modeling generation section.

Note

For reccomended behaviour over 5000 bootstraps should be used. This ensures that we are properly estimating the normal distribution.

Bootstrapping is enabled by passing the required arguments during initialization. You must specify:

A list of random seeds for reproducibility
bootstrap_flag=True to enable resampling
The number of bootstrap iterations (num_bootstraps)
The sample size for each bootstrap
Optional settings for stratification and balancing

import numpy as np
import equiboots as eqb

int_list = np.linspace(0, 100, num=10, dtype=int).tolist()

eq2 = eqb.EquiBoots(
    y_true=y_test,
    y_pred=y_pred,
    y_prob=y_prob,
    fairness_df=fairness_df,
    fairness_vars=["race"],
    seeds=int_list,
    reference_groups=["White"],
    task="binary_classification",
    bootstrap_flag=True,
    num_bootstraps=5001,
    boot_sample_size=1000,
    group_min_size=150,
    balanced=False,
    stratify_by_outcome=False,
)

Step 1.2: Slice by Group and Compute Metrics

Once initialized, use the grouper() and slicer() methods to prepare bootstrapped samples for each subgroup:

eq2.grouper(groupings_vars=["race"])
boots_race_data = eq2.slicer("race")

Once this is done we get the metrics for each bootstrap, this will return a list of metrics for each bootstrap. This may take some time to run.

race_metrics = eq2.get_metrics(boots_race_data)

2. Disparity Analysis

Disparities quantify how model performance varies across subgroups relative to a reference. Here we look at the ratio.

In the context of bias and fairness in machine learning, disparity refers to the differences in model performance, predictions, or outcomes across different demographic or sensitive groups.
It quantifies how a model’s behavior varies for subgroups based on attributes like race, sex, age, or other characteristics.

Here’s how you can represent the disparity ratio for a given metric (M) and a specific group (G) compared to a reference group (R):

\[\text{Disparity Ratio} = \frac{M(G)}{M(R)} \quad\]

For example, if you are looking at the “Predicted Prevalence” metric (the proportion of individuals predicted to have a positive outcome), the Predicted Prevalence Disparity Ratio for a group (e.g., “Black”) compared to a reference group (e.g., “White”) would be:

\[\text{Predicted Prevalence Ratio} = \frac{\text{Predicted Prevalence (Black)}}{\text{Predicted Prevalence (White)}}\]

dispa = eq2.calculate_disparities(race_metrics, "race")

Plot Disparity Ratios

eq_group_metrics_plot(group_metrics, metric_cols, name, plot_type='violinplot', categories='all', include_legend=True, cmap='tab20c', color_by_group=True, save_path=None, filename='Disparity_Metrics', max_cols=None, strict_layout=True, figsize=None, show_grid=True, plot_thresholds=(0.0, 2.0), show_pass_fail=False, leg_cols=6, y_lim=None, statistical_tests=None, **plot_kwargs)

Plot group and disparity metrics using Seaborn (violinplot, boxplot, etc.) with optional pass/fail coloring, statistical annotations, and layout customization.

Parameters:

group_metrics (List[Dict[str, Dict[str, float]]]) – A list of dictionaries, each mapping group/category names to metric values.
metric_cols (List[str]) – List of metric column names to plot (e.g., accuracy, precision).
name (str) – Title prefix or identifier to use in subplot titles.
plot_type (str) – Seaborn plot type to use (e.g., 'violinplot', 'boxplot').
categories (str or List[str]) – Which categories/groups to include; ‘all’ or list of category names.
include_legend (bool) – Whether to include a legend in the plot.
cmap (str) – Colormap name for group coloring.
color_by_group (bool) – Whether to assign separate colors by group.
save_path (str or None) – Directory path to save the figure (if not None).
filename (str) – Filename to use for saving the plot (excluding file extension).
max_cols (int or None) – Maximum number of subplot columns. Rows are inferred.
strict_layout (bool) – Whether to apply tight layout for better spacing.
figsize (Tuple[float, float] or None) – Optional figure size as (width, height) in inches.
show_grid (bool) – Whether to display gridlines on each subplot.
plot_thresholds (Tuple[float, float]) – Tuple indicating the (lower, upper) thresholds for pass/fail coloring.
show_pass_fail (bool) – Whether to color tick labels based on threshold pass/fail status.
leg_cols (int) – Number of columns to use in the legend layout.
y_lim (Tuple[float, float] or None) – Y-axis limits for all subplots. If None, determined automatically.
statistical_tests (dict) – Dictionary of test results, used to flag significant differences.
plot_kwargs (dict) – Additional keyword arguments to pass to the Seaborn plotting function.

Returns:

None. The plot is either displayed or saved to disk.

Return type:

None

Note

The plot_type parameter is flexible and accepts any valid Seaborn categorical plot function such as 'violinplot', 'boxplot', or 'stripplot'.

eqb.eq_group_metrics_plot(
    group_metrics=dispa,
    metric_cols=[
        "Accuracy_Ratio", "Precision_Ratio", "Predicted_Prevalence_Ratio",
        "Prevalence_Ratio", "FP_Rate_Ratio", "TN_Rate_Ratio", "Recall_Ratio",
    ],
    name="race",
    categories="all",
    plot_type="violinplot",
    color_by_group=True,
    strict_layout=True,
    figsize=(15, 8),
    leg_cols=7,
    max_cols=4,
)

Output

3. Metric Differences

EquiBoots also enables the user to look at the disparity in metric differences. The difference between the performance of the model for one group against the reference group.

And here’s how you can represent the disparity difference between a group and the reference:

\[\text{Disparity Difference} = M(G) - M(R)\]

Where:

\(M(G)\) is the value of the metric for group \(G\).
\(M(R)\) is the value of the metric for the reference group \(R\).

For example, if you are looking at the “Predicted Prevalence” metric differences, the Predicted Prevalence Disparity Difference for a group (e.g., “Black”) compared to a reference group (e.g., “White”) would be:

\[\begin{split}\begin{align*} \text{Predicted Prevalence Difference} &= \text{Predicted Prevalence (Black)} \\ &\quad - \text{Predicted Prevalence (White)} \end{align*}\end{split}\]

diffs = eq2.calculate_differences(race_metrics, "race")

eqb.eq_group_metrics_plot(
    group_metrics=diffs,
    metric_cols=[
        "Accuracy_diff", "Precision_diff", "Predicted_Prevalence_diff",
        "Prevalence_diff", "FP_Rate_diff", "TN_Rate_diff", "Recall_diff",
    ],
    name="race",
    categories="all",
    plot_type="violinplot",
    color_by_group=True,
    strict_layout=True,
    figsize=(15, 8),
    leg_cols=7,
    max_cols=4,
)

Output

4. Statistical Significance Testing

To determine whether disparities are statistically significant, EquiBoots provides bootstrap-based hypothesis testing. This involves comparing the distribution of bootstrapped metric differences to a null distribution of no effect.

metrics_boot = [
    "Accuracy_diff", "Precision_diff", "Recall_diff", "F1_Score_diff",
    "Specificity_diff", "TP_Rate_diff", "FP_Rate_diff", "FN_Rate_diff",
    "TN_Rate_diff", "Prevalence_diff", "Predicted_Prevalence_diff",
    "ROC_AUC_diff", "Average_Precision_Score_diff", "Log_Loss_diff",
    "Brier_Score_diff", "Calibration_AUC_diff"
]

test_config = {
    "test_type": "bootstrap_test",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
    "confidence_level": 0.95,
    "classification_task": "binary_classification",
    "tail_type": "two_tailed",
    "metrics": metrics_boot,
}

stat_test_results = eq2.analyze_statistical_significance(
    metric_dict=race_metrics,
    var_name="race",
    test_config=test_config,
    differences=diffs,
)

4.1: Metrics Table with Significance Annotations

You can summarize bootstrap-based statistical significance using metrics_table():

stat_metrics_table_diff = eqb.metrics_table(
    race_metrics,
    statistical_tests=stat_test_results,
    differences=diffs,
    reference_group="White",
)

Note

Asterisks (*) indicate significant omnibus test results.
Triangles (▲) indicate significant pairwise differences from the reference group.

Output

Metric	Black	Asian-Pac-Islander
Accuracy_diff	0.070 *	-0.050
Precision_diff	0.141 *	0.016
Recall_diff	-0.111	-0.119
F1_Score_diff	-0.050	-0.080
Specificity_diff	0.056 *	-0.002
TP_Rate_diff	-0.111	-0.119
FP_Rate_diff	-0.056 *	0.002
FN_Rate_diff	0.111	0.119
TN_Rate_diff	0.056 *	-0.002
Prevalence_diff	-0.122 *	0.035
Predicted_Prevalence_diff	-0.133 *	-0.016
ROC_AUC_diff	0.035	-0.041
Average_Precision_Score_diff	-0.005	-0.044
Log_Loss_diff	-0.131 *	0.113
Brier_Score_diff	-0.043 *	0.036
Calibration_AUC_diff	0.148 *	0.215 *

4.2: Visualize Differences with Significance

Finally, plot the statistically tested metric differences:

eqb.eq_group_metrics_plot(
    group_metrics=diffs,
    metric_cols=metrics_boot,
    name="race",
    categories="all",
    figsize=(20, 10),
    plot_type="violinplot",
    color_by_group=True,
    show_grid=True,
    max_cols=6,
    strict_layout=True,
    show_pass_fail=False,
    statistical_tests=stat_test_results,
)

Output

Bootstrapped Group Curve Plots

eq_plot_bootstrapped_group_curves(boot_sliced_data, curve_type='roc', common_grid=np.linspace(0, 1, 100), bar_every=10, n_bins=10, line_kwgs=None, title='Bootstrapped Curve by Group', filename='bootstrapped_curve', save_path=None, figsize=(8, 6), dpi=100, subplots=False, n_cols=2, n_rows=None, group=None, color_by_group=True, exclude_groups=0, show_grid=True, y_lim=None)

Plots bootstrapped ROC, precision-recall, or calibration curves by group. This function takes a list of bootstrapped group-level datasets and computes uncertainty bands for each curve using interpolation over a shared x-axis grid. Results can be rendered in overlay or subplot formats, with optional gridlines and curve-specific annotations (e.g., AUROC, AUCPR, or Brier score).

Parameters:

boot_sliced_data (list[dict[str, dict[str, np.ndarray]]]) – A list of bootstrap iterations, each mapping group name to y_true and y_prob arrays.
curve_type (str) – Type of curve to plot: roc, pr, or calibration.
common_grid (np.ndarray) – Shared x-axis points used to interpolate all curves for consistency.
bar_every (int) – Number of points between vertical error bars on the bootstrapped curve.
n_bins (int) – Number of bins for calibration plots.
line_kwgs (dict[str, Any] or None) – Optional style parameters for the diagonal or baseline reference line.
title (str) – Title of the entire plot.
filename (str) – Filename (without extension) used when saving the plot.
save_path (str or None) – Directory path to save the figure. If None, the plot is displayed instead.
figsize (tuple[float, float]) – Size of the figure as a (width, height) tuple in inches.
dpi (int) – Dots-per-inch resolution of the figure.
subplots (bool) – Whether to show each group’s curve in a separate subplot.
n_cols (int) – Number of columns in the subplot grid.
n_rows (int or None) – Number of rows in the subplot grid. Auto-calculated if None.
group (str or None) – Optional name of a single group to plot instead of all groups.
color_by_group (bool) – Whether to assign colors by group identity.
exclude_groups (int | str | list[str] | set[str]) – Groups to exclude from the plot, either by name or by minimum sample size.
show_grid (bool) – Whether to display gridlines on each plot.
y_lim (tuple[float, float] or None) – Optional y-axis limits (min, max) to enforce on the plots.

ROC AUC Curves

The example below shows bootstrapped ROC curves stratified by race group. Each curve reflects the average ROC performance across resampled iterations, with vertical error bars illustrating variability.

By toggling the subplots argument, the visualization can either overlay all group curves on a single axis (subplots=False) or display each group in its own panel (subplots=True), depending on the desired layout.

Example 1 (Overlayed Curves with Error Bars)

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="roc",
    title="Bootstrapped ROC Curve by Race",
    bar_every=100,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Output

This view helps quantify variability in model performance across subpopulations. Overlaying curves in a single plot (subplots=False) makes it easy to compare uncertainty bands side by side. Groups with insufficient data or minimal representation can be excluded using exclude_groups.

Example 2 (subplots=True)

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="roc",
    title="Bootstrapped ROC Curve by Race",
    bar_every=100,
    subplots=True,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Output

Bootstrapped ROC AUC Curves by Race (subplots)

This multi‐panel layout makes side-by-side comparison of each group’s uncertainty bands straightforward.

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="pr",
    title="Bootstrapped PR Curve by Race",
    filename="boot_roc_race",
    save_path="./images",
    subplots=True,
    bar_every=100,
    # n_rows=1,
    n_cols=1,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Precision-Recall Curves

The example below presents bootstrapped precision-recall (PR) curves grouped by race. Each curve illustrates the average precision-recall relationship across bootstrapped samples, with vertical error bars indicating the variability at select recall thresholds.

As with ROC curves, setting subplots=False overlays all groups in a single plot, allowing for compact comparison. Alternatively, setting subplots=True creates individual panels for each group to better visualize variations in precision across recall levels.

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="pr",
    title="Bootstrapped PR Curve by Race",
    subplots=True,
    bar_every=100,
    n_cols=1,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Output

Bootstrapped PR Curves by Race (subplots)

Subplot mode offers a cleaner side-by-side comparison of each group’s bootstrapped precision-recall behavior, making small differences in model performance easier to interpret.

Calibration Curves

The following example visualizes bootstrapped calibration curves grouped by race. Each curve reflects the average alignment between predicted probabilities and observed outcomes, aggregated over multiple resampled datasets. Vertical bars show variability in the calibration estimate at evenly spaced probability intervals.

As with ROC and PR plots, subplots=False will overlay all group curves on one axis, while subplots=True generates a separate panel for each group.

Example 1 (Overlayed Calibration Curves with Error Bars)

eqb.eq_plot_bootstrapped_group_curves(
    boot_sliced_data=boots_race_data,
    curve_type="calibration",
    title="Bootstrapped Calibration Curve by Race",
    subplots=True,
    bar_every=10,
    dpi=100,
    n_bins=10,
    figsize=(6, 6),
    color_by_group=True,
)

Output

Bootstrapped Calibration Curves by Race (overlayed)

Using subplots offers a focused view of calibration accuracy for each group, allowing nuanced inspection of where the model’s confidence aligns or diverges from observed outcomes.

Bootstrapped Forest Plots

Use a bootstrapped forest plot to visualize groupwise point estimates with 95% confidence intervals. This shows the mean of a chosen metric for each subgroup with error bars and optional significance flags. It pairs naturally with subgroup metrics produced in Step 2: Slice Groups and Compute Point Estimates.

eq_plot_bootstrap_forest(group_boot_metrics, metric='Accuracy', reference_group=None, figsize=(6, 4), save_path=None, filename='bootstrap_forest', title=None, statistical_tests=None)

Create a forest plot of a bootstrap metric with 95% CI for each group. If a reference_group is provided, draw a vertical dotted line through its mean. Add asterisks to group labels when significance tests indicate a difference.

Parameters:

group_boot_metrics (list[dict[str, dict[str, numpy.ndarray]]]) – List of bootstrap samples. Each sample is a dict mapping group names to dicts of metric values. Expected shape: List[Dict[str, Dict[str, np.ndarray]]].
metric (str) – The metric to summarize and plot, for example 'Accuracy' or 'ROC AUC'.
reference_group (str or None) – Group name used for the vertical reference line. If None, no line is drawn.
figsize (tuple[float, float]) – Figure size as (width, height).
save_path (str or None) – Directory to save the plot. If None, the plot is shown.
filename (str) – Filename stem for saving the plot, without extension.
title (str or None) – Optional plot title. If None, a default is generated.
statistical_tests (dict or None) – Optional mapping with significance results per group. Keys are group names. Values are dicts keyed by f"{metric}_diff" that hold an object with .is_significant. Used to annotate group labels with an asterisk.

Example

eqb.eq_plot_bootstrap_forest(
    group_boot_metrics=boots_race_data,
    metric="ROC AUC",
    reference_group="White",
    title="AUROC - Bootstrapped Race Metrics",
    figsize=(8, 6),
)

Important

metric can be any metric produced in your pipeline, for example: 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'Specificity', 'TP Rate', 'FP Rate', 'FN Rate', 'TN Rate', 'TP', 'FP', 'FN', 'TN', 'Prevalence', 'Predicted Prevalence', 'ROC AUC', 'Average Precision Score', 'Log Loss', 'Brier Score', 'Calibration AUC'. These come from the same subgroup computations referenced in Step 2.

Output

Bootstrapped Forest Plot of Group Metrics with 95% CI

Tabular Breakdown of Bootstrapped Metrics

While bootstrapped forest plots provide a visual summary of uncertainty across groups, the same information can also be broken down in tabular format. The helper function calculate_bootstrap_stats takes the list of bootstrapped samples and computes summary statistics for each group and metric.

calculate_bootstrap_stats(group_boot_metrics, metric)

Calculate mean, standard deviation, and 95% confidence intervals for a given metric across all groups and bootstrap samples. Returns a tidy DataFrame suitable for inspection or export.

Parameters:

group_boot_metrics (list[dict[str, dict[str, float]]]) – List of bootstrap samples. Each sample is a nested dictionary mapping group names to their metric values.
metric (str) – Name of the metric to analyze (e.g., 'Accuracy', 'Precision').

Returns:

DataFrame with columns:

group: Group name
mean: Average bootstrapped metric value
ci_lower: 2.5th percentile (lower bound of 95% CI)
ci_upper: 97.5th percentile (upper bound of 95% CI)
std: Standard deviation of bootstrapped samples
n_samples: Number of bootstrap samples used

Return type:

pandas.DataFrame

Example

eqb.calculate_bootstrap_stats(
    group_boot_metrics=boots_race_data,
    metric="ROC AUC"
)

Output (sample)

group	mean	ci_lower	ci_upper	std	n_samples
White	0.915083	0.908539	0.921298	0.003250	5001
Black	0.956088	0.936282	0.972532	0.009244	5001
Asian-Pac-Islander	0.910970	0.872990	0.944444	0.018353	5001

Note

These tabular summaries complement the forest plots:

The plot highlights differences visually, with confidence intervals.
The table provides exact numerical values, which are useful for reporting in papers, dashboards, or statistical summaries.

Summary

Bootstrapping provides a rigorous and interpretable framework for evaluating fairness by estimating uncertainty in performance metrics, computing disparities, and identifying statistically significant differences between groups.

Use EquiBoots to support robust fairness audits that go beyond simple point comparisons and account for sampling variability and multiple comparisons.