EquiBoots Logo

Calibration Curves and Area Under the Curve

Understanding the mathematical intuition behind calibration curves and related metrics helps clarify their diagnostic value in evaluating model reliability. This section outlines foundational concepts using simplified examples, progressing toward their real-world interpretation in model evaluation.

Calibration Curves and Area Interpretation

Calibration curves visualize how well predicted probabilities align with actual outcomes. A perfectly calibrated model lies along the diagonal line, where predicted probability equals observed frequency.

Below are two manual examples using toy functions to illustrate the concept of area under the calibration curve, a key component of metrics like Calibration AUC.

Example 1: Calibration with y = x²

This function simulates underconfident predictions, where the model consistently underestimates risk.

To compute the calibration area under this curve from ( x = 0 ) to ( x = 1 ):

\[\text{Area} = \int_0^1 x^2 \, dx\]

Solution:

\[\left[ \frac{x^3}{3} \right]_0^1 = \frac{1}{3}\]

The area under the ideal calibration line (diagonal) is:

\[\int_0^1 x \, dx = \left[ \frac{x^2}{2} \right]_0^1 = \frac{1}{2}\]

So, the polygonal calibration AUC becomes:

\[\frac{1}{2} - \frac{1}{3} = \frac{1}{6}\]
Toy Calibration Polygon Example - x^2

Example 2: Calibration with y = x² + 4x

This toy example models overconfident predictions, where the model consistently overshoots risk.

To calculate the area under the curve from ( x = 0 ) to ( x = 1 ), we compute the definite integral:

\[\text{Area} = \int_0^1 (x^2 + 4x) \, dx\]

Solution:

We split the integral into two separate parts:

\[\int_0^1 (x^2 + 4x) \, dx = \int_0^1 x^2 \, dx + \int_0^1 4x \, dx\]

First Integral:

\[\int_0^1 x^2 \, dx = \left[ \frac{x^3}{3} \right]_0^1 = \frac{1^3}{3} - \frac{0^3}{3} = \frac{1}{3}\]

Second Integral:

\[\int_0^1 4x \, dx = 4 \int_0^1 x \, dx = 4 \left[ \frac{x^2}{2} \right]_0^1 = 4 \left( \frac{1^2}{2} - \frac{0^2}{2} \right) = 4 \cdot \frac{1}{2} = 2\]

Final Answer:

\[\int_0^1 (x^2 + 4x) \, dx = \frac{1}{3} + 2 = \frac{7}{3}\]

This result represents the total area under the curve \(y = x^2 + 4x\) over the interval \([0, 1]\). If comparing against the ideal calibration line \(( y = x)\), you would subtract the diagonal area \(( \frac{1}{2})\) to isolate the calibration polygon AUC.

Note

In real calibration plots, the area is bounded within [0,1] on both axes. This example is meant to illustrate the mechanics of integration over a custom curve.

Toy Calibration Polygon Example - x^2 + 4x

Regression Residuals

\[\text{residual}_i = y_i - \hat{y}_i\]

These residuals are used to compute various point estimate metrics that summarize model performance on a given dataset. Common examples include:

  • Mean Absolute Error (MAE):

    \[\text{MAE} = \frac{1}{n} \sum_{i=1}^n \left| y_i - \hat{y}_i \right|\]
  • Mean Squared Error (MSE):

    \[\text{MSE} = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2\]
  • Root Mean Squared Error (RMSE):

    \[\text{RMSE} = \sqrt{\text{MSE}}\]

These are considered point estimates because they provide single-value summaries of the model’s residual error without incorporating uncertainty or sampling variability. To assess the stability or confidence of these estimates, techniques such as bootstrapping can be used to generate distributions over repeated samples.

Chi-Square Tests and Cochran’s Rule

The chi-square test of independence relies on a large-sample approximation. Its sampling distribution approaches the theoretical chi-square distribution only when expected cell counts are sufficiently large. When expected counts are small, the approximation breaks down and p-values become unreliable.

Chi-Square Statistic

For a contingency table with observed counts \(O_{ij}\) and expected counts \(E_{ij}\), the chi-square statistic is:

\[\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

The expected count under the null hypothesis of independence is computed from the row and column marginals:

\[E_{ij} = \frac{R_i \cdot C_j}{N}\]

where \(R_i\) is the total of row \(i\), \(C_j\) is the total of column \(j\), and \(N\) is the grand total.

Cochran’s Rule

Cochran (1954) provides a practical validity criterion: if more than 20% of expected cell counts fall below 5, the chi-square approximation should not be trusted. For a contingency table with \(K \times J\) cells, the rule is violated when:

\[\frac{\#\{(i,j) : E_{ij} < 5\}}{K \cdot J} > 0.20\]

When this happens, alternative tests are recommended:

  • For 2 x 2 tables: Fisher’s exact test

  • For larger tables: Fisher-Freeman-Halton exact test, or a chi-square test with a Monte Carlo simulated p-value

Worked Example: Sparse K x 2 Table

Consider a K x 2 contingency table for the Recall metric across three groups, populated with small counts:

\[\begin{split}\begin{array}{|c|c|c|} \hline \text{Group} & \text{TP} & \text{FN} \\ \hline \text{ref} & 2 & 1 \\ \text{A} & 1 & 1 \\ \text{B} & 1 & 0 \\ \hline \end{array}\end{split}\]

Row totals: \(R = (3, 2, 1)\). Column totals: \(C = (4, 2)\). Grand total: \(N = 6\).

Compute the expected count for each cell:

\[E_{\text{ref}, \text{TP}} = \frac{3 \cdot 4}{6} = 2.00\]
\[E_{\text{ref}, \text{FN}} = \frac{3 \cdot 2}{6} = 1.00\]
\[E_{A, \text{TP}} = \frac{2 \cdot 4}{6} = 1.33\]
\[E_{A, \text{FN}} = \frac{2 \cdot 2}{6} = 0.67\]
\[E_{B, \text{TP}} = \frac{1 \cdot 4}{6} = 0.67\]
\[E_{B, \text{FN}} = \frac{1 \cdot 2}{6} = 0.33\]

All six expected cells fall below 5, so the violation fraction is:

\[\frac{6}{6} = 1.00 > 0.20\]

Cochran’s rule is violated. The chi-square approximation is unreliable on this table, and a more appropriate test should be substituted.

Note

In EquiBoots, this check is built into _chi_square_test. When the rule is violated on a 2 x 2 table, the implementation transparently swaps in Fisher’s exact test. On larger K x 2 tables, a warning is emitted recommending Fisher’s exact as a follow-up.

Reference

Kim HY (2017). Statistical notes for clinical researchers: Chi-squared test and Fisher’s exact test. Restorative Dentistry & Endodontics, 42(2), 152-155. https://pmc.ncbi.nlm.nih.gov/articles/PMC5426219/