Does Verbose Chain-of-Thought Really Help? In-Distribution Evidence that Content, Not Length, Matters
2026-06-29 • Artificial Intelligence
Artificial IntelligenceComputation and Language
AI summaryⓘ
The authors studied why Chain-of-Thought (CoT) prompting helps large language models (LLMs) reason better. They found that just having more words (tokens) doesn't improve accuracy unless those words add meaningful reasoning steps or checks. When two explanations said the same thing but one was wordier, the wordier one helped only a little, and mostly if it was clear and useful, not just long. Overall, their results show that the quality of the reasoning content matters more than the quantity of words, challenging simpler ideas about how extra tokens help.
Chain-of-Thought promptingLarge Language ModelsReasoningTokensSemantic contentIntermediate stepsValidationDirected acyclic graphNumerical redactionBootstrap confidence intervals
Authors
Wenlong Wang, Fergal Reid
Abstract
Chain-of-thought (CoT) prompting improves LLM reasoning, but the source is contested: do the intermediate steps help because they carry useful semantic content, or because conditioning on more tokens buys extra computation before the model commits to an answer? We bring two lines of evidence to bear. First, in distribution: we repeatedly sample each model on the same question and pair a shorter with a longer of its own natural generations that follow the same reasoning plan, so nothing is rewritten and both traces are genuinely in-distribution. Across 25 models the extra tokens leave accuracy essentially unchanged for every independently-trained reasoner, and a blind analysis of the surplus tokens shows that what gain exists elsewhere tracks validation- and checking-content, not verbosity per se. Second, as a controlled intervention, we ask whether two traces expressing the same semantic content (the same facts, operations, and intermediate values, verified through directed acyclic graph equivalence) produce different outcomes when one is more verbose, using a dual-validator design across four targets and eight benchmarks with number-redacted completion and stratified bootstrap confidence intervals. Verbose traces do improve accuracy (25 of 32 benchmark-target cells are positive under at least one validator), but the effects are modest (typically 1-4 points) and depend on the quality of the verbose prose, not merely its length. Under maximum numerical redaction the effect is amplified (median 3.24x across four arithmetic benchmarks), and length-matched non-reasoning filler recovers none of it. Both lines converge: what matters is what the extra tokens do (the reasoning and validation content they carry), not how many there are, a picture neither a pure forward-pass-compute nor a pure semantic-content account fully explains.