EquiBoots Class

class equiboots.EquiBootsClass.EquiBoots(y_true: array, y_pred: array, fairness_df: DataFrame, fairness_vars: list, y_prob: array | None = None, seeds: list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], reference_groups: list | None = None, task: str = 'binary_classification', bootstrap_flag: bool = False, num_bootstraps: int = 10, boot_sample_size: int = 100, balanced: bool = True, stratify_by_outcome: bool = False, group_min_size: int = 10)[source]

Bases: object

calculate_differences(metric_dict, ref_var_name: str) → dict[source]: Calculate difference metrics for each group based on the task type.

calculate_disparities(metric_dict, var_name: str) → dict[source]: Calculate disparities metrics for each group based on the task type.

calculate_groups_differences(metric_dict: dict, ref_var_name: str) → dict[source]: Calculate differences between each group and the reference group.

calculate_groups_disparities(metric_dict: dict, var_name: str) → dict[source]: Calculate disparities between each group and the reference group.

check_classification_task(task)[source]

check_fairness_vars(fairness_vars)[source]

check_group_empty(sampled_group: array, cat: str, var: str) → bool[source]: Check if sampled group is empty.

check_group_size(group: Index, cat: str, var: str) → bool[source]: Check if a group meets the minimum size requirement.

check_task(task)[source]

get_groups_metrics(sliced_dict: dict, metric_list: List[str] | None = None) → dict[source]: Calculate metrics for each group based on the task type or a custom list.

get_metrics(sliced_dict, metric_list: List[str] | None = None) → dict[source]: Calculate metrics for each group based on the task type.

static list_adjustment_methods() → Dict[str, str][source]: List available adjustment methods and their descriptions.

static list_available_tests() → Dict[str, str][source]: List available statistical tests and their descriptions.

set_fix_seeds(seeds: list) → None[source]: Set fixed random seeds for bootstrapping or reproducibility.

set_reference_groups(reference_groups)[source]

Overview

The EquiBoots class provides tools for fairness-aware evaluation and bootstrapping of machine learning model predictions. It supports binary, multi-class, multi-label classification, and regression tasks, and enables group-based metric calculation, disparity analysis, and statistical significance testing.

Constructor

class equiboots.EquiBootsClass.EquiBoots(y_true, y_pred, fairness_df, fairness_vars, y_prob=None, seeds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], reference_groups=None, task='binary_classification', bootstrap_flag=False, num_bootstraps=10, boot_sample_size=100, balanced=True, stratify_by_outcome=False, group_min_size=10)[source]

Initialize a new EquiBoots instance.

Parameters

y_true (numpy.ndarray) Ground truth labels.
y_pred (numpy.ndarray) Predicted labels.
fairness_df (pandas.DataFrame) DataFrame containing fairness variables.
fairness_vars (list of str) Names of fairness variables.
y_prob (numpy.ndarray, optional) Predicted probabilities.
seeds (list of int, optional) Random seeds for bootstrapping.
reference_groups (list, optional) Reference group for each fairness variable.
task (str) One of binary_classification, multi_class_classification, multi_label_classification, or regression.
bootstrap_flag (bool) Whether to perform bootstrapping.
num_bootstraps (int) Number of bootstrap iterations.
boot_sample_size (int) Size of each bootstrap sample.
balanced (bool) If True, balance samples across groups; otherwise stratify by original proportions.
stratify_by_outcome (bool) Stratify sampling by outcome label.
group_min_size (int) Minimum group size (groups smaller than this are omitted).

Returns

None

Main Methods

equiboots.EquiBootsClass.grouper(groupings_vars)

Groups data by the specified fairness variables and stores category indices.

Parameters

groupings_vars (list of str) Variables to group by.

Returns

None

equiboots.EquiBootsClass.slicer(slicing_var)

Slice y_true, y_prob, and y_pred by a single fairness variable.

Parameters

slicing_var (str) Variable name to slice by.

Returns

dict or list of dict: Sliced outputs.

equiboots.EquiBootsClass.get_metrics(sliced_dict)

Calculate performance metrics for each group.

Parameters

sliced_dict (dict or list of dict) Output of slicer.

Returns

dict or list of dict: Metrics per group.

equiboots.EquiBootsClass.calculate_disparities(metric_dict, var_name)

Compute ratio disparities against the reference group.

Parameters

metric_dict (dict or list of dict) Group metrics.
var_name (str) Fairness variable name.

Returns

dict or list of dict: Ratio disparities.

equiboots.EquiBootsClass.calculate_differences(metric_dict, ref_var_name)

Compute difference disparities against the reference group.

Parameters

metric_dict (dict or list of dict) Group metrics.
ref_var_name (str) Reference group name.

Returns

dict or list of dict: Difference disparities.

equiboots.EquiBootsClass.analyze_statistical_significance(metric_dict, var_name, test_config, differences=None)

Perform significance testing on metric differences.

Parameters

metric_dict (dict or list of dict) Group metrics.
var_name (str) Fairness variable name.
test_config (dict) Statistical test configuration.
differences (dict, optional) Precomputed differences.

Returns

dict of dict: Nested results of the form {outer_key: {metric_name: StatTestResult}}, where outer_key is either "omnibus" or a group name, and metric_name is one of the metrics in StatisticalTester.METRIC_LIST.

equiboots.EquiBootsClass.set_fix_seeds(seeds)

Set fixed random seeds for reproducibility.

Parameters

seeds (list of int) Seeds to apply.

Returns

None

equiboots.EquiBootsClass.list_available_tests()

List the available statistical tests.

Returns

dict: Test names and descriptions.

equiboots.EquiBootsClass.list_adjustment_methods()

List the available p-value adjustment methods.

Returns

dict: Adjustment methods.

Non-Main/Internal Methods

equiboots.EquiBootsClass.set_reference_groups(reference_groups)

Set or infer reference groups for fairness variables.

Parameters

reference_groups (list) Reference groups to use.

Returns

None

equiboots.EquiBootsClass.check_task(task)

Validate the task type.

Parameters

task (str) Task name.

Returns

None

equiboots.EquiBootsClass.check_classification_task(task)

Ensure the task is a classification type.

Parameters

task (str) Task name.

Returns

None

equiboots.EquiBootsClass.check_fairness_vars(fairness_vars)

Validate the fairness variables input.

Parameters

fairness_vars (list of str) Variables to validate.

Returns

None

equiboots.EquiBootsClass.check_group_size(group, cat, var)

Verify minimum size for a group.

Parameters

group Group data.
cat Category name.
var Variable name.

Returns

None

equiboots.EquiBootsClass.check_group_empty(sampled_group, cat, var)

Check if a sampled group is empty.

Parameters

sampled_group Group data.
cat Category name.
var Variable name.

Returns

None

equiboots.EquiBootsClass.sample_group(group, n_categories, indx, sample_size, seeds, balanced)

Draw bootstrap or stratified samples.

Parameters

group Group data.
n_categories (int) Number of categories.
indx Indices of data.
sample_size (int) Bootstrap sample size.
seeds (list of int) Random seeds.
balanced (bool) Balance flag.

Returns

The sampled group data.

equiboots.EquiBootsClass.groups_slicer(groups, slicing_var)

Slice data into categories for a given variable.

Parameters

groups Group index mapping.
slicing_var (str) Variable name.

Returns

dict or list of dict: Sliced data.

equiboots.EquiBootsClass.get_groups_metrics(sliced_dict)

Calculate metrics for each group.

Parameters

sliced_dict (dict or list of dict) Sliced data.

Returns

dict or list of dict: Metrics per group.

equiboots.EquiBootsClass.calculate_groups_disparities(metric_dict, var_name)

Compute ratio disparities for each group.

Parameters

metric_dict (dict or list of dict) Group metrics.
var_name (str) Fairness variable name.

Returns

dict or list of dict: Ratio disparities.

equiboots.EquiBootsClass.calculate_groups_differences(metric_dict, ref_var_name)

Compute difference disparities for each group.

Parameters

metric_dict (dict or list of dict) Group metrics.
ref_var_name (str) Reference group name.

Returns

dict or list of dict: Difference disparities.

Example Usage

Below are two dummy examples demonstrating how to use the EquiBoots class: one without bootstrapping and one with bootstrapping.

For more detailed examples, refer to that Colab notebook or py_scripts/testingscript.py.

Point Estimates Without Bootstrapping

import numpy as np
import pandas as pd
from equiboots import EquiBoots

# Example data
y_true = np.array([0, 1, 1, 0, 1])
y_prob = np.array([0.2, 0.8, 0.7, 0.4, 0.9])
y_pred = np.array([0, 1, 1, 0, 1])
fairness_df = pd.DataFrame({
    "race": ["A", "B", "A", "B", "A"],
    "sex": ["M", "F", "F", "M", "F"]
})

eq = EquiBoots(
    y_true=y_true,
    y_prob=y_prob,
    y_pred=y_pred,
    fairness_df=fairness_df,
    fairness_vars=["race", "sex"],
    task="binary_classification",
    bootstrap_flag=False
)

eq.grouper(groupings_vars=["race"])
sliced = eq.slicer("race")
metrics = eq.get_metrics(sliced)
disparities = eq.calculate_disparities(metrics, "race")

print("Metrics by group:", metrics)
print("Disparities:", disparities)

With Bootstrapping

import numpy as np
import pandas as pd
from equiboots import EquiBoots

# Example data
y_true = np.array([0, 1, 1, 0, 1])
y_prob = np.array([0.2, 0.8, 0.7, 0.4, 0.9])
y_pred = np.array([0, 1, 1, 0, 1])
fairness_df = pd.DataFrame({
    "race": ["A", "B", "A", "B", "A"],
    "sex": ["M", "F", "F", "M", "F"]
})

eq = EquiBoots(
    y_true=y_true,
    y_prob=y_prob,
    y_pred=y_pred,
    fairness_df=fairness_df,
    fairness_vars=["race", "sex"],
    task="binary_classification",
    bootstrap_flag=True,
    num_bootstraps=5,
    boot_sample_size=5
)

eq.grouper(groupings_vars=["race"])
sliced = eq.slicer("race")
metrics = eq.get_metrics(sliced)
disparities = eq.calculate_disparities(metrics, "race")

print("Metrics by group (bootstrapped):", metrics)
print("Disparities (bootstrapped):", disparities)

StatisticalTester

Module: equiboots.StatisticalTester

Overview

This module provides statistical significance testing utilities for fairness audits, including bootstrapped tests and per-metric chi-square tests with support for multiple comparison corrections and effect size calculations.

For chi-square testing on binary classification metrics, each metric in METRIC_LIST is tested using a metric-specific 2-cell contingency built by get_contingency_table (for example Recall uses [TP, FN], Precision uses [TP, FP]). When expected cell counts are small, Cochran’s rule is applied: if more than 20% of expected cells are below 5 the test falls back to Fisher’s exact for 2x2 tables, or emits a warning for K x 2 tables. See Kim HY (2017), https://pmc.ncbi.nlm.nih.gov/articles/PMC5426219/.

Results are returned as a nested dictionary of the form {outer_key: {metric_name: StatTestResult}}, where outer_key is either "omnibus" or a non-reference group name.

Classes

StatTestResult

class equiboots.StatisticalTester.StatTestResult(statistic: float, p_value: float, is_significant: bool, test_name: str, critical_value: float | None = None, effect_size: float | None = None, confidence_interval: Tuple[float, float] | None = None)[source]

Bases: object

Stores statistical test results including test statistic, p-value, and significance.

StatisticalTester

class equiboots.StatisticalTester.StatisticalTester[source]

Bases: object

Performs statistical significance testing on metrics with support for various tests and data types.

adjusting_p_vals(config, results)[source]: Runs the adjusting p value method based on bootstrap conditions

analyze_metrics(metrics_data: Dict | List[Dict], reference_group: str, test_config: Dict[str, Any], task: str | None = None, differences: dict | None = None) → Dict[str, Dict[str, StatTestResult]][source]: Analyzes metrics for statistical significance against a reference group.

calc_p_value_bootstrap(data: list, config: dict) → float[source]: Calculating the p-value using the data and config

cohens_d(data_1, data_2)[source]: Calculate Cohen’s d

get_ci_bounds(config: dict) → tuple[source]: Get confidence interval bounds based on tail type

Class Attributes

equiboots.EquiBootsClass.METRIC_LIST

List of binary-classification metrics tested by _chi_square_test. Each metric is tested via a metric-specific 2-cell contingency table built by get_contingency_table:

Recall ([TP, FN])
Precision ([TP, FP])
Accuracy ([TP+TN, FP+FN])
F1 Score ([TP, FP+FN])
Specificity ([TN, FP])
FP Rate ([FP, TN])
FN Rate ([FN, TP])
Predicted Prevalence ([TP+FP, TN+FN])

Function Signatures

class equiboots.EquiBootsClass.StatTestResult(statistic: float, p_value: float, is_significant: bool, test_name: str, critical_value: float | None = None, effect_size: float | None = None, confidence_interval: Tuple[float, float] | None = None)[source]: Stores statistical test results including test statistic, p-value, and significance.

class equiboots.EquiBootsClass.StatisticalTester[source]

Performs statistical significance testing on metrics with support for various tests and data types.

_bootstrap_test(data: List[float], config: dict) → StatTestResult[source]

get_ci_bounds(config: dict) → tuple[source]

calc_p_value_bootstrap(data: list, config: dict) → float[source]

_chi_square_test(metrics: Dict[str, Any], config: Dict[str, Any]) → Dict[str, StatTestResult][source]: Runs one chi-square test per metric in METRIC_LIST, using a metric-specific 2-cell contingency table. Returns a dictionary keyed by metric name. Falls back to Fisher’s exact test when Cochran’s rule is violated on a 2x2 table.

get_contingency_table(data: pd.DataFrame, metric: str) → pd.DataFrame[source]: Build the metric-specific 2-cell contingency table from a DataFrame of per-group confusion matrix counts. data is expected to have one row per group and columns TP, FP, TN, FN. metric must be one of the entries in METRIC_LIST.

_calculate_effect_size(metrics: Dict, metric: str) → float[source]: Compute Cramer’s V on the same K x 2 contingency slice used by the chi-square test for the given metric.

_adjust_p_values(results: Dict[str, Dict[str, StatTestResult]], method: str, alpha: float, boot: bool = False) → Dict[str, Dict[str, StatTestResult]][source]: Adjust p-values for multiple comparisons. For non-bootstrapped results, adjustment is applied per metric across pairwise (non-omnibus) groups only.

analyze_metrics(metrics_data: Dict | List[Dict], reference_group: str, test_config: Dict[str, Any], task: str | None = None, differences: dict | None = None) → Dict[str, Dict[str, StatTestResult]][source]

adjusting_p_vals(config, results)[source]

_validate_config(config: Dict[str, Any])[source]

cohens_d(data_1, data_2)[source]: Standardized mean difference: (mean(data_1) - mean(data_2)) / pooled_std. Returns 0 when the pooled standard deviation is 0.

_analyze_single_metrics(metrics: Dict, reference_group: str, config: Dict[str, Any]) → Dict[str, Dict[str, StatTestResult]][source]: Three-phase pipeline for non-bootstrapped metrics: (1) run the omnibus chi-square across all groups, (2) for each non-reference group, run a pairwise comparison against the reference (gated on any omnibus significance), (3) annotate effect sizes per (group, metric) where both the omnibus and the pairwise test are significant.

_analyze_bootstrapped_metrics(metrics_diff: list[Dict], reference_group: str, config: Dict[str, Any]) → Dict[str, Dict[str, StatTestResult]][source]

Usage Example

from equiboots.StatisticalTester import StatisticalTester

tester = StatisticalTester()

config = {
    "test_type": "chi_square",
    "alpha": 0.05,
    "adjust_method": "bonferroni",
}

metrics = {
    "group1": {"TP": 10, "FP": 5, "TN": 20, "FN": 2},
    "group2": {"TP": 8, "FP": 7, "TN": 18, "FN": 4},
}

results = tester.analyze_metrics(
    metrics,
    reference_group="group1",
    test_config=config,
    task="binary_classification",
)

# Results are nested: {outer_key: {metric_name: StatTestResult}}
# outer_key is either "omnibus" or a non-reference group name.
for outer_key, metric_results in results.items():
    for metric, result in metric_results.items():
        print(
            f"{outer_key} / {metric}: "
            f"p-value={result.p_value:.4f}, "
            f"significant={result.is_significant}, "
            f"test={result.test_name}"
        )