Key-Value Means

2026-05-11 • Machine Learning

Machine LearningArtificial IntelligenceComputation and Language

AI summaryⓘ

The authors introduce Key-Value Means (KVM), a new type of attention mechanism for transformers that can handle either a fixed or growing amount of memory. Using KVM, their model can efficiently process long sequences with less computational cost and memory use compared to traditional transformers. KVM works with standard software tools, supports parallel training in chunks, and can be combined with other models that use recurrent neural networks to enhance performance on long contexts. The authors provide their code and trained models openly for others to use and explore.

transformerattention mechanismrecurrent neural network (RNN)long context modelingmemory efficiencyparallel trainingchunkingfixed-size statesublinear complexitypretraining

Authors

Daniel Goldstein, Eugene Cheah

Abstract

We present Key-Value Means ("KVM"), a novel block-recurrence for attention that can accommodate either fixed-size or growing state. Equipping a strong transformer baseline with fixed-size KVM attention layers yields a strong $O(N)$ chunked RNN, while adding only an insignificant number of new parameters. We train a transformer with a growable KVM cache and show it performs competitively on long-context tests with only subquadratic prefill time and sublinear state growth. KVM is implementable with standard operations and without custom kernels, and supports chunk-wise parallelizable training and prefill. It provides many of the benefits of both traditional transformers (expandable context memory, chunk-wise parallelizable training and prefill) and linear RNNs in a single unified package. It can be used on every layer, saving KV-cache memory, and allowing a continuous range of choices of prefill time complexity between $O(N)$ and $O(N^2)$. It can also be implemented in a hybrid solution in tandem with LRNN layers in place of traditional attention, to supplement the LRNN with improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/recursal/KVM-paper and trained models at https://huggingface.co/collections/recursal/key-value-means under the Apache 2.0 license.

View PDFOpen arXiv