MATCH: Modulating Attention via In-Context Retrieval for Long-Context Transformers
2026-06-29 • Computation and Language
Computation and LanguageArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors identify that traditional attention methods in large language models use a lot of computing power, especially when dealing with long texts. To fix this, they created MATCH, a new system that combines efficient, limited attention with a smart way to pull in relevant context when needed. Their tests show MATCH helps models remember important long-range details better without slowing them down. This approach works well both on made-up and real language tasks.
attention mechanismquadratic complexitysparse attentionlarge language modelslong-context modelingin-context retrievalefficiencyscalabilitynatural language processing
Authors
Linrui Ma, Chun Hei Lo, Xinyu Wang, Peng Lu, Xihao Yuan, Hanting Chen, Kai Han, Xinghao Chen, Chengjun Zhan, Hanlin Xu, Yichun Yin, Lifeng Shang, Feng Wen, Boxing Chen, Yufei Cui
Abstract
The quadratic computational cost of traditional attention mechanisms poses a major bottleneck to the scalability and practical deployment of large language models (LLMs), particularly in long-context scenarios. To improve efficiency, existing approaches often enforce rigid structural constraints such as local attention windows. However, these strategies typically lead to substantial performance degradation on tasks requiring precise long-range recall. In this work, we propose MATCH, a scalable and efficient framework that augments sparsified attention mechanisms with dynamically integrated in-context information through an efficient retrieval system. Empirical results show that MATCH significantly improves the performance of sparse-attention models on both synthetic and real-world natural-language tasks. These findings highlight the versatility of MATCH as a general approach for enhancing in-context retrieval capabilities while maintaining the efficiency benefits of sparse attention architectures.