The Role of Ambiguity in Error Prediction via Uncertainty Quantification

2026-06-01 • Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors studied how to better predict when large language models (LLMs) will make mistakes. They found that existing uncertainty measures sometimes mix up true confusion from the input (ambiguity) with the model's own uncertainty. By separating out this input ambiguity using special methods and labels, they improved error prediction for question-answering tasks. Their approach worked well across different models, datasets, and situations where inputs themselves can be uncertain.

Error PredictionUncertainty QuantificationAleatoric UncertaintyLarge Language ModelsQuestion AnsweringGated ExpertsSelective PredictionInput AmbiguityPRR (Precision-Recall)

Authors

Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos

Abstract

The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.

View PDFOpen arXiv