Does the Same Token Mean the Same State? MoE Routing as Signal for Reasoning Control

2026-06-22Computation and Language

Computation and Language
AI summary

The authors investigate how sparse Mixture-of-Experts (MoE) language models route tokens during generation. They find that the same token ID can correspond to different internal expert routes depending on context and reasoning mode. Using these routing patterns at key points in the output, they develop RAD, a new method to select the best answer without needing to compare output strings directly. RAD performs similarly to traditional majority voting on well-defined tasks and works better when exact string matching isn't possible, like in code generation or patch selection.

Sparse Mixture-of-ExpertsToken routingRouter stateMulti-rollout selectionWeighted-Jaccard similarityMajority votingCode generationPatch selectionTest-time controlAnswer-string-free decoding
Authors
Kang Chen, Minshen Yu, Junjie Nian, Yaoning Wang, Yixin Cao, Yugang Jiang
Abstract
In sparse Mixture-of-Experts language models, does the same token id imply the same router state and the same experts producing it? Holding the emitted token id fixed at repeated anchors, we find it does not: the experts that produce it still separate task context, trajectory history, and reasoning-effort mode. This residual structure supports test-time control: near \emph{boundary} anchors (the final-response transition) and \emph{delimiter} anchors (which open the answer, e.g.\ \texttt{\textbackslash boxed\{} or code fences), routing neighborhoods already align with final-answer basins at a marker-only readout and strongest when the routing is read at the answer opening. We operationalize this as \textbf{RAD} (Routing Agreement Decoding), an answer-string-free multi-rollout selector: it locates a fixed anchor, represents each rollout by its anchor-window MoE routing states, and returns the densest Weighted-Jaccard $K$-NN route-basin center, without parsing, normalizing, executing, or voting over answer strings. Across 10 sparse-MoE configurations (gpt-oss, Qwen3-MoE) and 6 datasets spanning math, GPQA, and code, RAD is on par with Majority where string voting is well-posed, with small positive paired deltas (RAD $73.9$ / RAD+DC $74.2$ vs.\ Majority $73.6$). Like majority voting, RAD is not a verifier: a dense \emph{wrong} basin can still win. Its value is the interface: the same selector gives direct pass@1 on code, where exact-string voting is ill-defined, and the same routing-density principle, re-anchored to the agentic boundary, improves best-of-16 patch selection on SWE-bench Verified over random, where patches have no answer string to vote on.