Task-Aware Calibration: Provably Optimal Decoding in LLMs

2026-05-11Machine Learning

Machine LearningComputation and Language
AI summary

The authors explain that large language models (LLMs) often make less-than-ideal choices because their predicted outputs don't always match the true way things should be generated. They propose a method called task calibration, which focuses on aligning the model’s predictions within meaningful categories like labels or numbers instead of raw text. By doing this, they show that a strategy called Minimum Bayes Risk decoding becomes the best way to decide on outputs based on the calibrated predictions. Their experiments show this approach consistently improves model outputs, and they introduce a new way to measure how well the model is calibrated called Task Calibration Error.

large language modelspredictive distributiontask calibrationMinimum Bayes Risk (MBR)latent spacemodel calibrationdecodingTask Calibration Error (TCE)semantic latent structure
Authors
Tim Tomov, Dominik Fuchsgruber, Rajeev Verma, Stephan Günnemann
Abstract
LLM decoding often relies on the model's predictive distribution to generate an output. Consequently, misalignment with respect to the true generating distribution leads to suboptimal decisions in practice. While a natural solution is to calibrate the model's output distribution, for LLMs, this is ill-posed at the combinatorially vast level of free-form language. We address this by building on the insight that in many tasks, these free-form outputs can be interpreted in a semantically meaningful latent structure, for example, discrete class labels, integers, or sets. We introduce task calibration as a paradigm to calibrate the model's predictive distribution in the task-induced latent space. We apply a decision-theoretic result to show that Minimum Bayes Risk (MBR) decoding on the task-calibrated latent distribution is the optimal decoding strategy on latent model beliefs. Empirically, it consistently improves generation quality across different tasks and baselines. We also introduce Task Calibration Error (TCE), an application-aware calibration metric that quantifies the excess loss due to miscalibration. Our work demonstrates that task calibration enables more reliable model decisions across various tasks and applications.