When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation

2026-05-25Computation and Language

Computation and Language
AI summary

The authors studied how changes in the meaning of questions, like using synonyms or paraphrases, affect the answers given by AI thinking processes compared to changes in how the questions are presented, like formatting. They found that changes in meaning cause more answer differences than presentation changes, a pattern that held across multiple AI models and datasets. Some tests showed this effect is not just due to chance, but other tests highlighted limits in the findings. The authors also discovered that while early reasoning steps remain stable, meaning changes cause divergence later in the thinking process, which they call 'stealth-divergence.' They released all their data and code to help verify and build on their work.

chain-of-thought reasoningReAct agentslarge language modelssemantic perturbationspresentation perturbationsGSM8KHotpotQAbootstrap significancetrajectory analysisstealth-divergence
Authors
Liyun Zhang, Jiayi Guo
Abstract
We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, $p<0.0001$). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ($κ=0.50$). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability$\times$tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch $t=3.81$, $p=9.6\times10^{-4}$). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emph{stealth-divergence} picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.