The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models
2026-06-22 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors looked at why large language models (LLMs) like Qwen3 and Llama sometimes give uncertain answers and tried to understand where that uncertainty comes from in more detail. They created a new system to break down uncertainty into parts like the input, model parameters, each word generated, and the decoding process. They tested 21 different ways to measure uncertainty and found that some methods, especially consensus-based ones, worked better depending on the type of task. They also noticed that bigger models tend to be less uncertain. This helps people better measure and trust LLM responses in different situations.
Large Language Models (LLMs)Uncertainty Quantification (UQ)Aleatoric uncertaintyEpistemic uncertaintyBayesian methodsEnsemble methodsConsensus-based methodsDecoding processModel scalingTriviaQA
Authors
Xiang-Jun Ou, Shuang Liang, Xin-Yu Hu, Rong-Hao Huang, Jing Wang, Shao-Qun Zhang
Abstract
Recent advancements in Large Language Models (LLMs) have enabled sophisticated reasoning and content generation, yet their inherent stochasticity poses significant challenges for ensuring predictive credibility. While traditional uncertainty taxonomy paradigms, such as the dichotomy of aleatoric and epistemic uncertainties, provide conceptual foundations, they often fail to capture the multi-component and multi-stage nature of LLM generation and struggle to evaluate the effectiveness of various Uncertainty Quantification (UQ) methods. In this paper, we propose a granular uncertainty taxonomy that systematically attributes LLM uncertainty into input-level, parameter-level, token-level, and decoding-process sources. Correspondingly, we categorize existing UQ methods into Bayesian, ensemble, consensus-based, and single-pass approaches. Furthermore, we introduce a comprehensive evaluation framework covering diverse generation settings and metrics. We empirically evaluate 21 typical UQ methods across three prominent LLM families, including Qwen3, Llama 3.2, and DeepSeek-V3, on benchmarks such as TriviaQA, GSM8K, and HumanEval. Our experimental results demonstrate that (i) the effectiveness of UQ methods is sensitive to task types and generation settings; (ii) consensus-based methods, typed Deg and EigV, consistently outperform other UQ approaches; and (iii) larger model scales correlate with lower uncertainty estimates, suggesting an empirical scaling law for LLM uncertainty. This work bridges the gap between theoretical origins and practical deployment, providing a versatile diagnostic tool for systematically quantifying uncertainty in LLM applications.