Curvature-Weighted Gradient Diversity: A Noise Measure for Geometry-Adaptive SGD Schedules

2026-06-29Machine Learning

Machine Learning
AI summary

The authors point out that usual methods for analyzing noise in mini-batch SGD treat all directions the same, even though some directions affect learning less due to curvature. They propose Curvature-Weighted Gradient Diversity (CWGD), which adjusts noise measurements by considering the curvature of the loss surface. By applying CWGD to adjust the learning rate schedule, they show improved optimization results on certain quadratic problems, lowering error by about 20% compared to standard methods. They also discuss practical issues like estimating curvature and limitations when dealing with non-convex problems.

Stochastic Gradient DescentMini-batchGradient NoiseHessianCurvatureLearning Rate ScheduleCosine AnnealingOptimization ErrorQuadratic ObjectiveHutchinson Estimator
Authors
Muhammad Hamza, Ayush Goel
Abstract
The standard convergence analysis of mini-batch stochastic gradient descent (SGD) models gradient noise using a single variance term that treats all parameter directions equally, ignoring the fact that noise in high-curvature directions has less impact because learning rates are already constrained there. We introduce Curvature-Weighted Gradient Diversity (CWGD), a geometry-aware measure that weights per-sample gradient diversity by the inverse square root of the Hessian, providing a tighter proxy for the effective optimization noise. For strongly convex quadratic objectives with diagonal Hessians and isotropic noise, we prove that a CWGD-modulated cosine learning-rate schedule can reduce the asymptotic optimization error floor by up to a factor of two compared with standard cosine annealing. We implement this idea as CWGD-Cosine using a Hutchinson-based diagonal Hessian estimator that is exact for quadratic objectives. Across a range of condition numbers, batch sizes, and noise structures, CWGD-Cosine consistently achieves approximately 20% lower final optimization error than standard cosine annealing while incurring negligible overhead in the quadratic setting. We also identify and correct a degenerate curvature estimator, analyze the robustness of the proposed estimator, and explicitly discuss the limitations of the method, including Hessian staleness in non-convex optimization. These results establish CWGD as a principled geometry-aware measure of optimization noise and motivate future extensions to more general learning problems.