Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing
2026-06-15 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors address the problem of efficiently understanding long text sequences in language models. They note that Transformers are good but slow with very long texts, while State Space Models (SSMs) are faster but less precise. Their solution, called Parallel Hybrid Architecture (PHA), combines both approaches running side-by-side, allowing each to do what it does best: SSMs capture overall context, attention mechanisms focus on details, and feed-forward networks add extra processing. Their experiments show that PHA achieves similar or better accuracy than existing models but runs faster and uses less memory, making it better suited for long texts.
TransformerSelf-AttentionState Space ModelsParallel Hybrid ArchitectureGated State SpacesGrouped Query AttentionFeed-Forward NetworksPerplexityLong-range dependenciesNatural Language Processing
Authors
Kuzey Torlak, Hüseyin Arda Arslan, Anıl Dervişoğlu, Beyza Nur Deniz, Onur Boyar
Abstract
Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ($O(N^2)$) with sequence length, while State Space Models (SSMs) scale linearly ($O(N)$) but suffer from a selective recall bottleneck, struggling to retrieve precise information from compressed states. This creates a fundamental tradeoff between efficiency and perplexity. To tackle these challenges, we propose the \textit{Parallel Hybrid Architecture (PHA)}, which runs Gated State Spaces (GSS), Grouped Query Attention (GQA), and Feed-Forward Networks (FFNs) as independent parallel branches fused by a learnable mixing mechanism. Instead of forcing SSMs to approximate attention or serializing the two paradigms, PHA allows each branch to specialize: GSS captures global context, while attention performs selective retrieval, with FFN providing complementary processing. On WikiText-103, PHA achieves 16.51 PPL at 125M parameters, outperforming Hedgehog (16.70) and H3-125M (23.70). Scaling to 180M parameters yields 16.42 PPL, which gives comparable results with the pure attention baseline while delivering 24\% higher throughput and up to 40\% lower memory usage at long contexts. On OpenWebText, our 125M model achieves 19.72 PPL, outperforming standard Transformers (20.60) and GSS hybrid baselines (19.80). These results demonstrate that separating sequence modeling paradigms into parallel specialists enables Transformer-level perplexity with substantially improved efficiency for long-context language modeling.