Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime

2026-06-22Machine Learning

Machine Learning
AI summary

The authors study how gradient descent (GD), a common training method for neural networks, behaves across many types of architectures, including multi-layer transformers, beyond simpler settings like the neural tangent kernel (NTK) regime. They prove that under reasonable conditions, GD will converge close to a stable point for almost all starting values, using new mathematical tools involving smoothness and special inequalities. Their analysis explains how learning rates should be adjusted based on network depth and bottleneck size, rather than just width. They also explore how some network features like residual connections help the training process and describe properties of optimal solutions within their framework.

gradient descentneural tangent kerneltransformersconvergencePL inequalityLipschitz smoothnessXavier initializationresidual connectionsstationary pointsgradient trajectory
Authors
Yuqing Wang
Abstract
Training dynamics is central to understanding neural networks, yet its theoretical analysis remains difficult even for simple architectures and becomes substantially more challenging for general modern architectures. In this paper, we propose a convergence framework for analyzing gradient descent (GD) dynamics under a broad family of neural network architectures and datasets beyond the neural tangent kernel (NTK) regime. The framework is formulated at the level of network blocks and covers architectures including pre-normalized multi-layer transformers. More precisely, under mild assumptions, we prove that for almost all initializations, GD with regular learning rates converges to the neighbourhood of a stationary point. This is mainly proved by establishing an iterate-dependent PL-type inequality through analyticity and measure-zero arguments, and by proving Lipschitz smoothness along the GD trajectory through polynomial generalized smoothness and a local relaxed dissipative condition. We further interpret the theorem under Xavier initialization and practical architectural scaling, showing that the learning rate scale depends on the depth and effective bottleneck dimensions rather than the largest width. Finally, we derive structural nondegeneracy implications for residual connections and function composition, and provide a generic characterization of global minimizers within our framework.