Variable-Width Transformers

2026-06-16Computation and Language

Computation and Language
AI summary

The authors studied how changing the width (size) of different layers in transformer language models affects performance. Instead of having all layers the same width, they made early and late layers wider and middle layers narrower, creating a > <-shaped pattern. This approach improved model accuracy while using less computation and memory across various model sizes. They also found that the narrower middle layers process information differently, showing this design changes how models learn. Overall, the authors show that giving different layers different widths can make language models more efficient.

transformerlanguage modelmodel size scalinglayer widthmodel depthresidual connectionsdecoder-only modelFLOPsMixture of Experts (MoE)KV cache
Authors
Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy, Yoon Kim
Abstract
Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.