When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

2026-06-08Computation and Language

Computation and Language
AI summary

The authors studied how "thinking" (a method that guides model responses) affects instruction-following in large reasoning models (LRMs) focused on math and coding tasks. They found that while overall accuracy changes little, the thinking mode shifts which questions models get right or wrong, improving planning tasks but hurting precise, detail-focused tasks. The change in answer length partly explains the decrease in precision, but not completely. They also discovered differing patterns in how model reasoning traces relate to final answers, and that certain errors linked to precision tasks can be fixed more often by intervening inside the models than planning-related errors. This suggests thinking helps some reasoning aspects but complicates others.

large reasoning modelsinstruction followingthinking modeplanning vs precisioncross-encoder relevanceactivation patchingpass-ratemodel sizeerror pattern
Authors
Sai Adith Senthil Kumar
Abstract
Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors--some prompts improve while others worsen--rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan's opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).