Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

2026-06-15Machine Learning

Machine LearningComputation and Language
AI summary

The authors study a way to make Transformers faster for long text by turning existing models into hybrid ones that use linear attention. They found that simply copying parts of the original model doesn't work well because the new model's dynamics are not set right. To fix this, they created a method called Taylor-Calibrate that smartly sets up the new model using information from the original one, leading to much better performance with less training. Their method makes the converted models start off much stronger and learn faster.

Transformerattentionlinear attentionhybrid modelGated DeltaNet (GDN)model conversioninitializationTaylor expansiondistillationlong-context inference
Authors
Zhongzhu Zhou, Qingyang Wu, Junxiong Wang, Mayank Mishra, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu
Abstract
Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.