Random Reshuffling Dominates Stochastic Gradient Descent

2026-06-30Machine Learning

Machine Learning
AI summary

The authors studied a popular method called Random Reshuffling (RR), used to improve a classic optimization algorithm named Stochastic Gradient Descent (SGD). While RR worked well in practice, existing theories said it only beats SGD under strict conditions related to step size and number of passes through data. The authors proved that RR is actually better than SGD for smooth convex problems with any reasonable step size and after any number of passes, closing a gap between theory and practice. This resolves a longstanding question about why RR performs so well.

Stochastic Gradient DescentRandom ReshufflingShuffling SGDConvex OptimizationSmooth FunctionsStep SizeConvergence RateEpochOptimization AlgorithmsTheoretical Guarantees
Authors
Zijian Liu
Abstract
Stochastic Gradient Descent ($\textsf{SGD}$) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of $\textsf{SGD}$ differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ($\textsf{Shuffling SGD}$). A particularly popular strategy in $\textsf{Shuffling SGD}$ is Random Reshuffling ($\textsf{RR}$), which has achieved great empirical success across numerous experiments. Despite its strong performance, $\textsf{RR}$ has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for $\textsf{RR}$, thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of $\textsf{RR}$ remain to this day. More precisely, according to the current theory, $\textsf{Shuffling SGD}$ under $\textsf{RR}$ converges only when the stepsize is smaller than a threshold proportional to $1/n$, where $n$ is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of $\textsf{Shuffling SGD}$ under $\textsf{RR}$ is strictly worse than that of $\textsf{SGD}$ when the number of epochs is smaller than another threshold proportional to $n$. These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that $\textsf{RR}$ dominates $\textsf{SGD}$ in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.