Screening Is Enough
2026-04-01 • Machine Learning
Machine LearningArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors point out that standard softmax attention only compares keys to each other instead of checking if each key is truly relevant to the query. They propose Multiscreen, which uses a new method called screening that filters out irrelevant keys based on a set threshold instead of spreading attention over all keys. This approach reduces competition between keys and leads to a model that is more efficient, learns better with higher learning rates, and performs well even with long input sequences. Their experiments show it also speeds up inference significantly while using fewer parameters compared to a standard Transformer.
softmax attentionquery-key relevanceTransformer architecturescreening mechanismmodel parameterslearning ratelong-context modelingperplexityinference latency
Authors
Ken M. Nakanishi
Abstract
A core limitation of standard softmax attention is that it does not define a notion of absolute query--key relevance: attention weights are obtained by redistributing a fixed unit mass across all keys according to their relative scores. As a result, relevance is defined only relative to competing keys, and irrelevant keys cannot be explicitly rejected. We introduce Multiscreen, a language-model architecture built around a mechanism we call screening, which enables absolute query--key relevance. Instead of redistributing attention across all keys, screening evaluates each key against an explicit threshold, discarding irrelevant keys and aggregating the remaining keys, thereby removing global competition among keys. Across experiments, Multiscreen achieves comparable validation loss with approximately 40% fewer parameters than a Transformer baseline, enables stable optimization at substantially larger learning rates, maintains strong performance in long-context perplexity, shows little to no degradation in retrieval performance even far beyond the training context length, and reduces inference latency by up to 3.2$\times$ at 100K context length.