Sensitivity-Positional Co-Localization in GQA Transformers
2026-04-09 • Computation and Language
Computation and LanguageArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors studied a special type of transformer model called Grouped Query Attention (GQA) transformers to see if the layers important for task performance overlap with those where changing positional encoding has the biggest effect. They expected these layers to be the same but found the opposite: important task layers are mostly near the end of the network, while positional encoding changes matter more near the beginning. Despite this, focusing certain adaptations on the task-important layers led to better results across several tests. Their methods improved performance close to a well-known strong model, showing smart targeted changes can boost efficiency.
Grouped Query Attention (GQA)transformerspositional encodingLoRA adaptationRoPE (Rotary Position Embedding)Spearman correlationlayer sensitivitytask performanceLlama 3.1model ablation
Authors
Manoj Chandrashekar Rao
Abstract
We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in\{23\text{-}31\}$) while RoPE-influential layers dominate the early network ($\ell\in\{0\text{-}9\}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.