Forget Attention: Importance-Aware Attention Is All You Need

2026-06-01 • Artificial Intelligence

Artificial IntelligenceComputation and LanguageMachine Learning

AI summaryⓘ

The authors address the challenge of combining two types of language models: Transformers, which look at all information but can’t decide what’s most important, and state space models (SSMs), which know what matters but can’t look back easily. Existing hybrids keep these models separate, so they don’t fully work together during attention calculations. The authors propose SISA, a new method that mixes SSM importance directly into the attention scores, enabling faster and more accurate language understanding without complicated changes. Their results show SISA is more efficient and effective than previous approaches.

TransformerState Space ModelsAttention MechanismHybrid Language ModelsSoftmax AttentionQuery/Key VectorsLAMBADA BenchmarkNIAH MetricScore-Level Fusion

Authors

Soohyeong Shin, Yeongwook Yang

Abstract

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

View PDFOpen arXiv