Tail-Shape Estimation in LLM Evaluation Is Fragile: A Protocol for Diagnosing False Positives

2026-06-15Machine Learning

Machine Learning
AI summary

The authors looked into how to better measure the worst-case errors of large language models, focusing on a number called the tail-index that describes the heaviness of error extremes. They created a careful testing procedure to check if this tail-index provides useful new information beyond average errors and simple extreme error size estimates. When they applied this procedure to toxicity tests across different scoring methods, they found that many previous claims about the tail-index were mistaken. Their work shows that tail-index estimates in these contexts are less reliable than thought and offers their protocol as a tool for more trustworthy analysis.

large language modelsevaluation metricstail-indexextreme value theoryconditional value-at-riskreward-model errortoxicity evaluationgoodness-of-fitstatistical significancethreshold stability
Authors
Luca Zhou
Abstract
Recent work motivates moving large language model (LLM) evaluation from mean-based to tail-aware metrics, including conditional value-at-risk and tail-index estimates of reward-model error. We ask whether the canonical extreme-value-theory tail-index parameter, which isolates how heavy a tail is from how large the tail mass is, adds discriminative information beyond the mean and a standard tail-magnitude statistic in LLM evaluation. We pre-register a protocol covering admissibility, goodness-of-fit, threshold-stability, and effect-size requirements for any positive tail-shape claim. The protocol is the contribution of this paper; the empirical study below is a demonstration of what its gates catch. Applied to a standard LLM toxicity-evaluation setup under two structurally different scorer families, the protocol catches three distinct modes of false positives that a naive analysis would have published, and rejects the headline tail-shape claim on both scorers. We conclude that tail-shape estimation in the LLM toxicity-evaluation setups we examined is more fragile than the recent literature suggests, and recommend the protocol as a starting point for tail-index claims in similar setups.