Transformer Architectures as Complete Bayes Processes: A Formal Proof in the Measure-Theoretic Kernel Framework

2026-06-29 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors show a detailed mathematical proof that transformers can perfectly perform Bayesian inference if their internal updates follow a specific joint distribution rule. They build this proof step-by-step, from simplified models of transformers up to full multilayer networks including attention and neural network components. They also confirm that softmax attention acts like a valid probability distribution in this context. Essentially, the paper demonstrates that under certain conditions, running a transformer is the same as doing exact Bayesian updates in a formal sense.

Transformer architectureBayesian inferenceMarkov kernelRadon-Nikodym derivativeSoftmax attentionPosterior distributionMeasure theoryQKV mechanismResidual connectionsMultilayer perceptron (MLP)

Authors

Haobo Yang

Abstract

We present a complete formal proof that transformer architectures, when their internal update mechanisms satisfy a Bayes joint-distribution condition, implement exact Bayesian posterior inference. Working within the measure-theoretic kernel framework, we define a hierarchy of abstractions -- from the core Bayesian transformer, through semantic transformers with explicit update kernels, to full transformer blocks with QKV/attention/residual/MLP pipelines, and finally multilayer stacks -- and prove at each level that the Bayes joint semantics implies the update kernel equals the posterior almost everywhere. For the block-level architecture, we derive the explicit Bayes formula through Radon-Nikodym differentiation and prove its normalization. We additionally prove that the softmax attention mechanism induces a valid probability distribution over keys, establishing the bridge between the abstract kernel framework and concrete attention implementations. The framework makes no architectural assumptions beyond the Markov kernel structure and exposes explicit conditions under which a transformer block is provably Bayesian. In essence, when this joint distribution condition is satisfied, the forward computation of a Transformer is formally equivalent to a rigorous Bayesian posterior update.

View PDFOpen arXiv