Adaptive inference and function vectors in deep transformers

2026-06-15 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors explain how transformers, a type of AI model, work internally by comparing them to systems where many parts interact together to make guesses about hidden information step-by-step. They show that transformers can use their layers to understand complex hidden patterns better as information passes through them. Their theory connects the model's depth with how well it can handle complicated hidden structures, and they test this with a specific kind of transformer. They find that the layers and processing units inside transformers let them learn and adapt in ways more complex than previously understood.

TransformersMean-field theoryIn-context learningLatent variablesHierarchical structureLinear attentionFeedforward networksDeep learningDistributed inferenceNeural network depth

Authors

Ravin Raj, Gautam Reddy

Abstract

Transformers are widely used as a general-purpose substrate for learning complex correlations between a large collection of coupled variables, but their internal mechanisms have remained mysterious. We introduce a theory of a deep transformer as a mean-field interacting system that implements distributed inference, subject to constraints on communication, locality and depth. We show that such a system can exploit internal state representations ('function vectors') to infer a latent context variable at increasingly finer scales over its layers. In an in-context regression task, the theory predicts a non-trivial relationship between non-Gaussian, hierarchical structure in the latent context variable, and transformer depth. Predictions are tested using constrained linear attention transformers and demonstrate adaptive inference in deep architectures. Feedforward blocks and depth enable transformers to implement a much richer class of in-context learning algorithms than previously described.

View PDFOpen arXiv