Capacity, Not Format: Rethinking Structured Reasoning Failures

2026-06-08Artificial Intelligence

Artificial IntelligenceComputation and Language
AI summary

The authors show that asking AI models to give structured answers, like JSON format, is harder for models close to their processing limits. Big models with extra capacity handle these formats well without losing accuracy. But smaller or near-capacity models perform worse, either because their answers get cut off or because organizing the answer competes with thinking power. The authors suggest that when a model is near its limits, it’s better to let it think freely first and add structure afterwards to keep accuracy high.

structured outputmodel capacityJSON formattoken budgetprompt lengthschema complexityGPT-4o-minitruncationcapacity competitionreasoning tax
Authors
Hengxin Fan
Abstract
Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.