EquiBoots Class
- class equiboots.EquiBootsClass.EquiBoots(y_true: array, y_pred: array, fairness_df: DataFrame, fairness_vars: list, y_prob: array | None = None, seeds: list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], reference_groups: list | None = None, task: str = 'binary_classification', bootstrap_flag: bool = False, num_bootstraps: int = 10, boot_sample_size: int = 100, balanced: bool = True, stratify_by_outcome: bool = False, group_min_size: int = 10)[source]
Bases:
object- calculate_differences(metric_dict, ref_var_name: str) dict[source]
Calculate difference metrics for each group based on the task type.
- calculate_disparities(metric_dict, var_name: str) dict[source]
Calculate disparities metrics for each group based on the task type.
- calculate_groups_differences(metric_dict: dict, ref_var_name: str) dict[source]
Calculate differences between each group and the reference group.
- calculate_groups_disparities(metric_dict: dict, var_name: str) dict[source]
Calculate disparities between each group and the reference group.
- check_group_empty(sampled_group: array, cat: str, var: str) bool[source]
Check if sampled group is empty.
- check_group_size(group: Index, cat: str, var: str) bool[source]
Check if a group meets the minimum size requirement.
- get_groups_metrics(sliced_dict: dict, metric_list: List[str] | None = None) dict[source]
Calculate metrics for each group based on the task type or a custom list.
- get_metrics(sliced_dict, metric_list: List[str] | None = None) dict[source]
Calculate metrics for each group based on the task type.
- static list_adjustment_methods() Dict[str, str][source]
List available adjustment methods and their descriptions.
- static list_available_tests() Dict[str, str][source]
List available statistical tests and their descriptions.
Overview
The EquiBoots class provides tools for fairness-aware evaluation and bootstrapping of machine learning model predictions. It supports binary, multi-class, multi-label classification, and regression tasks, and enables group-based metric calculation, disparity analysis, and statistical significance testing.
Constructor
- class equiboots.EquiBootsClass.EquiBoots(y_true, y_pred, fairness_df, fairness_vars, y_prob=None, seeds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], reference_groups=None, task='binary_classification', bootstrap_flag=False, num_bootstraps=10, boot_sample_size=100, balanced=True, stratify_by_outcome=False, group_min_size=10)[source]
Initialize a new
EquiBootsinstance.Parameters
y_true (numpy.ndarray) Ground truth labels.
y_pred (numpy.ndarray) Predicted labels.
fairness_df (pandas.DataFrame) DataFrame containing fairness variables.
fairness_vars (list of str) Names of fairness variables.
y_prob (numpy.ndarray, optional) Predicted probabilities.
seeds (list of int, optional) Random seeds for bootstrapping.
reference_groups (list, optional) Reference group for each fairness variable.
task (str) One of
binary_classification,multi_class_classification,multi_label_classification, orregression.bootstrap_flag (bool) Whether to perform bootstrapping.
num_bootstraps (int) Number of bootstrap iterations.
boot_sample_size (int) Size of each bootstrap sample.
balanced (bool) If
True, balance samples across groups; otherwise stratify by original proportions.stratify_by_outcome (bool) Stratify sampling by outcome label.
group_min_size (int) Minimum group size (groups smaller than this are omitted).
Returns
None
Main Methods
- equiboots.EquiBootsClass.grouper(groupings_vars)
Groups data by the specified fairness variables and stores category indices.
Parameters
groupings_vars (list of str) Variables to group by.
Returns
None
- equiboots.EquiBootsClass.slicer(slicing_var)
Slice
y_true,y_prob, andy_predby a single fairness variable.Parameters
slicing_var (str) Variable name to slice by.
Returns
- dict or list of dict
Sliced outputs.
- equiboots.EquiBootsClass.get_metrics(sliced_dict)
Calculate performance metrics for each group.
Parameters
sliced_dict (dict or list of dict) Output of
slicer.
Returns
- dict or list of dict
Metrics per group.
- equiboots.EquiBootsClass.calculate_disparities(metric_dict, var_name)
Compute ratio disparities against the reference group.
Parameters
metric_dict (dict or list of dict) Group metrics.
var_name (str) Fairness variable name.
Returns
- dict or list of dict
Ratio disparities.
- equiboots.EquiBootsClass.calculate_differences(metric_dict, ref_var_name)
Compute difference disparities against the reference group.
Parameters
metric_dict (dict or list of dict) Group metrics.
ref_var_name (str) Reference group name.
Returns
- dict or list of dict
Difference disparities.
- equiboots.EquiBootsClass.analyze_statistical_significance(metric_dict, var_name, test_config, differences=None)
Perform significance testing on metric differences.
Parameters
metric_dict (dict or list of dict) Group metrics.
var_name (str) Fairness variable name.
test_config (dict) Statistical test configuration.
differences (dict, optional) Precomputed differences.
Returns
- dict of dict
Nested results of the form
{outer_key: {metric_name: StatTestResult}}, whereouter_keyis either"omnibus"or a group name, andmetric_nameis one of the metrics inStatisticalTester.METRIC_LIST.
- equiboots.EquiBootsClass.set_fix_seeds(seeds)
Set fixed random seeds for reproducibility.
Parameters
seeds (list of int) Seeds to apply.
Returns
None
- equiboots.EquiBootsClass.list_available_tests()
List the available statistical tests.
Returns
- dict
Test names and descriptions.
- equiboots.EquiBootsClass.list_adjustment_methods()
List the available p-value adjustment methods.
Returns
- dict
Adjustment methods.
Non-Main/Internal Methods
- equiboots.EquiBootsClass.set_reference_groups(reference_groups)
Set or infer reference groups for fairness variables.
Parameters
reference_groups (list) Reference groups to use.
Returns
None
- equiboots.EquiBootsClass.check_task(task)
Validate the task type.
Parameters
task (str) Task name.
Returns
None
- equiboots.EquiBootsClass.check_classification_task(task)
Ensure the task is a classification type.
Parameters
task (str) Task name.
Returns
None
- equiboots.EquiBootsClass.check_fairness_vars(fairness_vars)
Validate the fairness variables input.
Parameters
fairness_vars (list of str) Variables to validate.
Returns
None
- equiboots.EquiBootsClass.check_group_size(group, cat, var)
Verify minimum size for a group.
Parameters
group Group data.
cat Category name.
var Variable name.
Returns
None
- equiboots.EquiBootsClass.check_group_empty(sampled_group, cat, var)
Check if a sampled group is empty.
Parameters
sampled_group Group data.
cat Category name.
var Variable name.
Returns
None
- equiboots.EquiBootsClass.sample_group(group, n_categories, indx, sample_size, seeds, balanced)
Draw bootstrap or stratified samples.
Parameters
group Group data.
n_categories (int) Number of categories.
indx Indices of data.
sample_size (int) Bootstrap sample size.
seeds (list of int) Random seeds.
balanced (bool) Balance flag.
Returns
The sampled group data.
- equiboots.EquiBootsClass.groups_slicer(groups, slicing_var)
Slice data into categories for a given variable.
Parameters
groups Group index mapping.
slicing_var (str) Variable name.
Returns
- dict or list of dict
Sliced data.
- equiboots.EquiBootsClass.get_groups_metrics(sliced_dict)
Calculate metrics for each group.
Parameters
sliced_dict (dict or list of dict) Sliced data.
Returns
- dict or list of dict
Metrics per group.
- equiboots.EquiBootsClass.calculate_groups_disparities(metric_dict, var_name)
Compute ratio disparities for each group.
Parameters
metric_dict (dict or list of dict) Group metrics.
var_name (str) Fairness variable name.
Returns
- dict or list of dict
Ratio disparities.
- equiboots.EquiBootsClass.calculate_groups_differences(metric_dict, ref_var_name)
Compute difference disparities for each group.
Parameters
metric_dict (dict or list of dict) Group metrics.
ref_var_name (str) Reference group name.
Returns
- dict or list of dict
Difference disparities.
Example Usage
Below are two dummy examples demonstrating how to use the EquiBoots class: one without bootstrapping and one with bootstrapping.
For more detailed examples, refer to that Colab notebook or py_scripts/testingscript.py.
Point Estimates Without Bootstrapping
import numpy as np
import pandas as pd
from equiboots import EquiBoots
# Example data
y_true = np.array([0, 1, 1, 0, 1])
y_prob = np.array([0.2, 0.8, 0.7, 0.4, 0.9])
y_pred = np.array([0, 1, 1, 0, 1])
fairness_df = pd.DataFrame({
"race": ["A", "B", "A", "B", "A"],
"sex": ["M", "F", "F", "M", "F"]
})
eq = EquiBoots(
y_true=y_true,
y_prob=y_prob,
y_pred=y_pred,
fairness_df=fairness_df,
fairness_vars=["race", "sex"],
task="binary_classification",
bootstrap_flag=False
)
eq.grouper(groupings_vars=["race"])
sliced = eq.slicer("race")
metrics = eq.get_metrics(sliced)
disparities = eq.calculate_disparities(metrics, "race")
print("Metrics by group:", metrics)
print("Disparities:", disparities)
With Bootstrapping
import numpy as np
import pandas as pd
from equiboots import EquiBoots
# Example data
y_true = np.array([0, 1, 1, 0, 1])
y_prob = np.array([0.2, 0.8, 0.7, 0.4, 0.9])
y_pred = np.array([0, 1, 1, 0, 1])
fairness_df = pd.DataFrame({
"race": ["A", "B", "A", "B", "A"],
"sex": ["M", "F", "F", "M", "F"]
})
eq = EquiBoots(
y_true=y_true,
y_prob=y_prob,
y_pred=y_pred,
fairness_df=fairness_df,
fairness_vars=["race", "sex"],
task="binary_classification",
bootstrap_flag=True,
num_bootstraps=5,
boot_sample_size=5
)
eq.grouper(groupings_vars=["race"])
sliced = eq.slicer("race")
metrics = eq.get_metrics(sliced)
disparities = eq.calculate_disparities(metrics, "race")
print("Metrics by group (bootstrapped):", metrics)
print("Disparities (bootstrapped):", disparities)
StatisticalTester
Module: equiboots.StatisticalTester
Overview
This module provides statistical significance testing utilities for fairness audits, including bootstrapped tests and per-metric chi-square tests with support for multiple comparison corrections and effect size calculations.
For chi-square testing on binary classification metrics, each metric in METRIC_LIST is tested using a metric-specific 2-cell contingency built by get_contingency_table (for example Recall uses [TP, FN], Precision uses [TP, FP]). When expected cell counts are small, Cochran’s rule is applied: if more than 20% of expected cells are below 5 the test falls back to Fisher’s exact for 2x2 tables, or emits a warning for K x 2 tables. See Kim HY (2017), https://pmc.ncbi.nlm.nih.gov/articles/PMC5426219/.
Results are returned as a nested dictionary of the form {outer_key: {metric_name: StatTestResult}}, where outer_key is either "omnibus" or a non-reference group name.
Classes
StatTestResult
- class equiboots.StatisticalTester.StatTestResult(statistic: float, p_value: float, is_significant: bool, test_name: str, critical_value: float | None = None, effect_size: float | None = None, confidence_interval: Tuple[float, float] | None = None)[source]
Bases:
objectStores statistical test results including test statistic, p-value, and significance.
StatisticalTester
- class equiboots.StatisticalTester.StatisticalTester[source]
Bases:
objectPerforms statistical significance testing on metrics with support for various tests and data types.
- adjusting_p_vals(config, results)[source]
Runs the adjusting p value method based on bootstrap conditions
- analyze_metrics(metrics_data: Dict | List[Dict], reference_group: str, test_config: Dict[str, Any], task: str | None = None, differences: dict | None = None) Dict[str, Dict[str, StatTestResult]][source]
Analyzes metrics for statistical significance against a reference group.
Class Attributes
- equiboots.EquiBootsClass.METRIC_LIST
List of binary-classification metrics tested by
_chi_square_test. Each metric is tested via a metric-specific 2-cell contingency table built byget_contingency_table:Recall([TP, FN])Precision([TP, FP])Accuracy([TP+TN, FP+FN])F1 Score([TP, FP+FN])Specificity([TN, FP])FP Rate([FP, TN])FN Rate([FN, TP])Predicted Prevalence([TP+FP, TN+FN])
Function Signatures
- class equiboots.EquiBootsClass.StatTestResult(statistic: float, p_value: float, is_significant: bool, test_name: str, critical_value: float | None = None, effect_size: float | None = None, confidence_interval: Tuple[float, float] | None = None)[source]
Stores statistical test results including test statistic, p-value, and significance.
- class equiboots.EquiBootsClass.StatisticalTester[source]
Performs statistical significance testing on metrics with support for various tests and data types.
- _bootstrap_test(data: List[float], config: dict) StatTestResult[source]
- _chi_square_test(metrics: Dict[str, Any], config: Dict[str, Any]) Dict[str, StatTestResult][source]
Runs one chi-square test per metric in
METRIC_LIST, using a metric-specific 2-cell contingency table. Returns a dictionary keyed by metric name. Falls back to Fisher’s exact test when Cochran’s rule is violated on a 2x2 table.
- get_contingency_table(data: pd.DataFrame, metric: str) pd.DataFrame[source]
Build the metric-specific 2-cell contingency table from a DataFrame of per-group confusion matrix counts.
datais expected to have one row per group and columnsTP,FP,TN,FN.metricmust be one of the entries inMETRIC_LIST.
- _calculate_effect_size(metrics: Dict, metric: str) float[source]
Compute Cramer’s V on the same K x 2 contingency slice used by the chi-square test for the given metric.
- _adjust_p_values(results: Dict[str, Dict[str, StatTestResult]], method: str, alpha: float, boot: bool = False) Dict[str, Dict[str, StatTestResult]][source]
Adjust p-values for multiple comparisons. For non-bootstrapped results, adjustment is applied per metric across pairwise (non-omnibus) groups only.
- analyze_metrics(metrics_data: Dict | List[Dict], reference_group: str, test_config: Dict[str, Any], task: str | None = None, differences: dict | None = None) Dict[str, Dict[str, StatTestResult]][source]
- cohens_d(data_1, data_2)[source]
Standardized mean difference:
(mean(data_1) - mean(data_2)) / pooled_std. Returns 0 when the pooled standard deviation is 0.
- _analyze_single_metrics(metrics: Dict, reference_group: str, config: Dict[str, Any]) Dict[str, Dict[str, StatTestResult]][source]
Three-phase pipeline for non-bootstrapped metrics: (1) run the omnibus chi-square across all groups, (2) for each non-reference group, run a pairwise comparison against the reference (gated on any omnibus significance), (3) annotate effect sizes per (group, metric) where both the omnibus and the pairwise test are significant.
Usage Example
from equiboots.StatisticalTester import StatisticalTester
tester = StatisticalTester()
config = {
"test_type": "chi_square",
"alpha": 0.05,
"adjust_method": "bonferroni",
}
metrics = {
"group1": {"TP": 10, "FP": 5, "TN": 20, "FN": 2},
"group2": {"TP": 8, "FP": 7, "TN": 18, "FN": 4},
}
results = tester.analyze_metrics(
metrics,
reference_group="group1",
test_config=config,
task="binary_classification",
)
# Results are nested: {outer_key: {metric_name: StatTestResult}}
# outer_key is either "omnibus" or a non-reference group name.
for outer_key, metric_results in results.items():
for metric, result in metric_results.items():
print(
f"{outer_key} / {metric}: "
f"p-value={result.p_value:.4f}, "
f"significant={result.is_significant}, "
f"test={result.test_name}"
)