Decoding in Order-Agnostic Language Models: Chain-Rule Deviation and Uniform Spreading

2026-05-31Computation and Language

Computation and Language
AI summary

The authors studied special language models that can predict missing words in any order, called order-agnostic language models (OALMs). They found that changing the order in which words are revealed affects how well the model predicts, meaning the model's confidence depends on the order, not just the content. They also showed that a method called confidence-first decoding behaves similarly to reading text left-to-right. Finally, the authors introduced a new way to measure how evenly the model's confidence spreads during prediction, which helps identify better or worse ways to generate text. Their results suggest using both average confidence and confidence variation to evaluate these models.

order-agnostic language modelsdiscrete diffusion language modelsmasked token predictionlog-likelihoodconfidence-first decodingleft-to-right decodingconfidence tracevariancedecoding pathstarget recoverability
Authors
Lin Yao
Abstract
Order-agnostic language models (OALMs), including discrete diffusion language models (dLLMs), are trained to predict masked tokens under arbitrary conditioning sets, allowing sequences to be generated or scored under arbitrary reveal orders at inference time. In LLaDA-2.1, we report three findings. First, the learned conditionals are not exact factorizations of a coherent joint distribution: changing only the reveal order shifts target log-likelihood by up to 0.49 nats/token, so likelihood alone mixes content difficulty with path-dependent artifacts. Second, although confidence-first (CF) decoding is order-agnostic, its reveal orders are close to left-to-right (L2R) on content tokens. Third, we propose a complementary diagnostic based on the shape of the confidence trace. A uniform-spreading theorem shows that, at fixed total likelihood, target recoverability is maximized when per-step confidence is spread uniformly; the resulting deviation motivates $\mathrm{Var}(\log q_t)$ as a diagnostic for comparing decoding paths. Across C4 and four downstream benchmarks, low variance separates structured paths from random ordering, and variance is consistently associated with downstream correctness. These results support reporting mean confidence and confidence variance jointly when comparing OALM decoding paths.