Morphing into Hybrid Attention Models
2026-06-29 • Computation and Language
Computation and Language
AI summaryⓘ
The authors study how to make Transformer models handle long texts more efficiently by mixing full and simpler (linear) attention layers. Instead of using basic guesses about which layers should stay full, they treat this choice as a smart selection problem and introduce FlashMorph, a method that finds the best layers to keep full attention within a given limit. FlashMorph adds linear attention branches to all full layers, fixes model weights, and learns which layers to keep full by training gates with special rules. Their tests show FlashMorph picks better hybrid models faster and keeps good performance on long texts compared to older methods.
TransformerAttention MechanismFull AttentionLinear AttentionHybrid ModelsLayer SelectionSubset OptimizationModel MorphingLogits DistillationLong-Context Handling
Authors
Disen Lan, Jianbin Zheng, Yuxi Ren, Xin Xia, Xuanda Wang, Xuefeng Xiao, Xipeng Qiu, Yu Cheng
Abstract
Hybrid attention models improve long-context efficiency by retaining only a subset of full-attention layers and replacing the remaining layers with linear attention. However, the effectiveness of Transformer-to-hybrid conversion critically depends on which layers preserve full attention. Existing hybrid layer selection methods typically rely on heuristic strategies such as fixed placement patterns or layerwise scoring, implicitly treating layer importance as isolated and overlooking the interdependent layer effect under a global hybrid configuration. In this work, we formulate hybrid layer selection as a budget-constrained subset optimization problem. We further propose FlashMorph (Fast LAyer Selection for Hybrid MORPHing), an effective, efficient and scalable layer selection method for Transformer-to-hybrid conversion. FlashMorph first constructs a morphable model by equipping each full-attention layer with a converted linear-attention branch. It then freezes all model weights and jointly optimizes layerwise gates on synthetic long-context retrieval data, with a linearization regularization that encourages the model to rely on linear attention for efficiency. The learned gates are discretized under a preset full-attention budget to instantiate the hybrid architecture, followed by standard logits distillation and long-context finetuning. Extensive experiments show that FlashMorph discovers more effective hybrid configurations, preserves strong long-context recall and general benchmark performance while substantially reducing layer selection cost compared with existing layer selection methods, demonstrating its effectiveness, efficiency, and scalability.