DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations

2026-06-01Computation and Language

Computation and Language
AI summary

The authors propose a new way to classify errors made by large language models (LLMs) based on how easily different uncertainty measures can detect them, rather than just the error type itself. Their DECK taxonomy sorts errors into four groups depending on consistency across examples and confidence in specific words, linking each group to particular scorer types that can catch those errors. They tested this with multiple models and datasets, confirming that certain errors align with specific detection methods. They also found a common blind spot: when models confidently produce wrong answers due to missing knowledge, current uncertainty measures fail, even when looking inside the model’s activations.

hallucinationlarge language modelsuncertainty quantificationDECK taxonomyerror detectabilityconsistency scorerstoken-level confidenceYouden's J statisticknowledge gapsactivation-level probes
Authors
Mohit Singh Chauhan
Abstract
Existing hallucination taxonomies classify LLM errors by what is wrong with the output -- memorised misconceptions, reasoning failures, fluent fabrications. These taxonomies are useful for diagnosis but cannot answer a different question: which uncertainty scorer would have caught this error? We propose a complementary taxonomy that classifies errors by their detectability signature -- the signal a scorer family would read. The DECK taxonomy is a 2x2 partition along inter-sample consistency and token-level confidence into four behavioural regimes (Drift, Entrenched, Confabulation, Knotted), each mapping to a specific scorer family (or families) that can detect it: black-box consistency scorers have signal in D and C, white-box token-probability scorers have signal in K and C, and only an LLM-as-a-Judge with independent pretraining can detect E. Cell membership is operationalised by a Youden's J optimal split on each scorer axis. Across three models and four datasets we validate the taxonomy two ways: by analysing scorer-pair disagreement, and by checking that external labels (SelfAware unanswerable, HaluEval adversarial, PopQA entity popularity) land in the predicted DECK cells, with model-scale and content-specific secondary-cell refinements. We further identify a universal blind spot of output-level UQ: on knowledge-gap inputs where the generator emits confident, repeatable fabrications, every output-level family collapses by construction. A linear probe on Llama-3-8B's hidden states also collapses to chance, giving preliminary evidence that the failure may persist at the activation level; richer internal-state methods (UQ heads, information-theoretic estimators) remain to be tested.