Attention is Just Another Name for Coupling?: A Fast-Slow ODE Perspective on Hierarchical Pretraining

2026-06-15Artificial Intelligence

Artificial IntelligenceMachine Learning
AI summary

The authors explore enhancing causal self-attention, a method where each part of a sequence learns from previous parts, by adding a second slower system that looks at a summarized, slower version of the sequence. They use math from differential equations to model this fast-slow interaction and build a neural network combining fast attention over all tokens with slow attention over pooled tokens, connected by a special gate starting closed. They show theoretically that, under certain assumptions, the slower system captures important equilibrium information about the sequence dynamics. Experimentally, the added slow system neither helps nor harms performance at the tested scale, so the main value is the new mathematical framework linking attention and differential equations.

causal self-attentionfast-slow dynamicssingular perturbationordinary differential equationsblock-mean poolingequilibrium manifoldmaster equationvariational approximationneural network architectureattention mechanism
Authors
Zhengyuan Gao
Abstract
Causal self-attention is a coupling mechanism: each token's hidden state is updated by a learned mixture of preceding tokens at the same timescale. This paper asks whether a second, temporally slower coupling-a slow sub-system operating on a temporally-downsampled view of the sequence and fed back into the fast path through a zero-initialised gate-complements it. The question is framed in the language of singularly perturbed ordinary differential equations (ODEs), where the fast variable $x$ evolves at the token rate, the slow variable $y$ evolves at one update per $P$ tokens, and the timescale ratio $\varepsilon = 1/P$ is enforced structurally by causal block-mean pooling. The paper instantiates the fast-slow ODE formalism as a concrete neural network: a fast path of standard causal attention over $T$ tokens, a slow path of full attention over $T/P$ pooled tokens ($P^2 \times$ cheaper per layer), and a zero-initialised additive gate. In addition, under a linear-generator assumption on the fast dynamics, we prove that the equilibrium manifold $x = φ(y)$ is exactly the master-equation (ME) stationary distribution $p_{\mathrm{st}}(y)$; in that regime a learned MLP $φ_θ(y)$ is a variational approximation of it (the trained block is not a generator, so this identity is the structured limit, not a claim about the network as trained). Empirically, at $500$k tokens the coupling is neutral -- the gate stays closed and the coupled and frozen ablations are within run-to-run noise -- at a wall-clock cost comparable to a dense baseline. The contribution is the precise, gap-marked mapping itself, not a performance gain.