PowLU: An Activation Function for Stable Pre-Training of LLMs
2026-05-25 • Computation and Language
Computation and LanguageMachine Learning
AI summaryⓘ
The authors explain that a popular activation function called SwiGLU, used in big language models, can cause training problems because it behaves like squaring large positive numbers, leading to unstable results especially when using low precision numbers. To fix this, they created a new activation function named PowLU, which adjusts its shape smoothly to keep training stable without losing the ability to model complex patterns. Their tests with different model sizes show that PowLU performs well and helps large models train more reliably. They also provide mathematical reasons why PowLU works better in these situations.
activation functionSwiGLUPower Linear Unit (PowLU)large language modelsnumerical stabilitylow-precision trainingnonlinearitymodel scalabilitypre-trainingscaling laws
Authors
Peijie Jiang, Yuqi Feng, Cunyin Peng, Qian Zhao, Jia Liu, KunLong Chen, Zhiqiang Zhang, Jun Zhou
Abstract
In contemporary large language models (LLMs), the swish-gated linear unit (SwiGLU) activation function is widely adopted to regulate the information flow and introduce non-linearity. For large positive inputs, SwiGLU approximates the quadratic function $x^2$, providing strong nonlinearity and expressive capacity. However, this property also causes numerical instability as the input or model scale increases, particularly in low-precision LLM training. The main reason is its approximate quadratic amplification, which enlarges the output range and exacerbates outliers. To address this issue, we propose a stable activation function, Power Linear Unit (PowLU), for large-scale LLM pre-training. Specifically, PowLU employs a rational power function to achieve adaptive nonlinearity, thereby improving representation ability and enabling stable training in spike regions. Moreover, we provide theoretical justification for several key properties of PowLU. Scaling law experiments confirm that the performance is consistent across model sizes, and further experimental results with the Ling architecture (7.9B and 124B total parameters) demonstrate that PowLU achieves competitive results against SwiGLU and SwiGLU-Clip in large-scale training of LLMs. In addition, the experimental results also show that PowLU effectively improves the scalability of the large-scale training of LLMs.