Continuous Latent Contexts Enable Efficient Online Learning in Transformers
2026-05-11 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors explore how transformers, a type of language model, can learn and adapt during ongoing tasks rather than just making static predictions. They show that using continuous latent context tokens—special internal memory bits—allows transformers to effectively perform online learning algorithms like weighted majority and Q-learning. By training a smaller model this way, it surpassed larger language models on long prediction tasks. This work suggests that continuous latent contexts help transformers keep track of what they've learned over time and adapt accordingly.
transformerslarge language modelsin-context learningonline learninglatent context tokensweighted majority algorithmQ-learningpersistent stateGPT-2multi-curriculum training
Authors
Emile Anand, Abdullah Ateyeh, Xinyuan Cao, Max Dabagia
Abstract
Large language models (LLMs) exhibit a strong capacity for in-context learning: Given labeled examples, they can generate good predictions without parameter updates. However, many interactive settings go beyond static prediction to online decision-making, in which effective behavior demands adaptation over long multi-turn horizons in response to feedback, and efficient algorithms in these domains must use compact representations of what they have learned. Recently, continuous transformer architectures with latent chain of thought have shown promise for offline iterative tasks such as directed graph-reachability. Motivated by this, we study whether continuous latent context tokens equip transformers to more effectively realize online learning. We give explicit constructions of constant-depth transformers that implement two foundational online decision-making procedures -- the weighted majority algorithm and $Q$-learning -- by storing their algorithmic state as linear combinations of feature embeddings, using a small number of latent context tokens. We further train a small GPT-2-style transformer with latent contexts using a multi-curriculum objective that does not directly supervise the latent states. On long synthetic online prediction sequences, this model outperforms larger and more complex LLMs, including Qwen-3-14B and DeepSeek-V3. Our results suggest that continuous latent contexts provide a simple and effective persistent state for transformers to implement online learning algorithms.