Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

2026-05-27Machine Learning

Machine Learning
AI summary

The authors study ways to improve large language models by combining two methods for handling long sequences: softmax attention and linear recurrent models. They create a hybrid model called Oryx that can switch between these methods within a sequence, sharing most of its parameters to work efficiently. Their experiments show that Oryx matches or beats models using just one method, especially on tasks needing long context or retrieval. This suggests that mixing attention and recurrent approaches along the sequence is a promising way to build better models.

softmax attentionlinear recurrent modelsstate space modelstoken sequenceparameter sharinglanguage modelingretrieval taskshybrid architecturesequence-axis hybridization
Authors
Kevin Y. Li, Asher Trockman, Ananda Theertha Suresh, Ziteng Sun
Abstract
Softmax attention is the cornerstone of modern large language models, but its memory scales linearly and compute quadratically with sequence length. Linear recurrent models, such as linear attention and state space models, have become widely studied as alternatives to attention due to their linear compute and constant memory. While these sub-quadratic token mixing methods, or mixers, achieve promising efficiency gains and competitive results on a wide range of benchmarks, current linear recurrent models still lag behind on tasks that require long-context retrieval or in-context learning. A growing body of work studies hybrid architectures that attempt to mitigate these trade-offs by statically interleaving or merging attention and recurrent blocks. In this work, we explore a new axis of developing hybrid models: across the token sequence. We propose Oryx, a hybrid model that can, throughout a sequence, flexibly switch between different mixers, for example quadratic attention for rich context utilization and linear recurrences for efficient generation. Oryx ties at least 90% of its parameters across mixers, enabling attention and recurrent modes to operate over shared internal representations. We validate our design with Mamba-2 and Gated DeltaNet variants, up to 1.4B models. Under fixed token budgets and a mixed-training strategy, Oryx achieves comparable or better performance than its single-mixer baselines. At the 1.4B scale, all instances of Oryx outperform their respective baselines by at least 0.7 percentage points on averaged language modeling tasks. On retrieval tasks, Oryx achieves performance comparable to the Transformer baseline even when processing only a tiny fraction (<10%) of the tokens in attention mode. These results suggest that attention and linear recurrent models can share internal representations, and motivate sequence-axis hybridization as a promising direction.