Gaussian Approximation for Asynchronous Q-learning

2026-04-08Machine Learning

Machine Learning
AI summary

The authors study a special kind of Q-learning algorithm used in reinforcement learning, focusing on how quickly its averaged results get close to a normal distribution as more data is collected. They assume that the process generating the data follows certain stability conditions and prove that the convergence rate depends on the number of samples, states, and actions. Their work includes a new mathematical result about sums of certain dependent random variables, which helps in understanding the algorithm's behavior. They also provide limits on the variability of the final step in the algorithm.

Q-learningPolyak-Ruppert averaginghigh-dimensional central limit theoremMarkov chainmartingale differencesasynchronous algorithmsconvergence ratesreinforcement learningstepsize schedulemoment bounds
Authors
Artemy Rubtsov, Sergey Samsonov, Vladimir Ulyanov, Alexey Naumov
Abstract
In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates generated by the asynchronous Q-learning algorithm with a polynomial stepsize $k^{-ω},\, ω\in (1/2, 1]$. Assuming that the sequence of state-action-next-state triples $(s_k, a_k, s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, we establish a rate of order up to $n^{-1/6} \log^{4} (nS A)$ over the class of hyper-rectangles, where $n$ is the number of samples used by the algorithm and $S$ and $A$ denote the numbers of states and actions, respectively. To obtain this result, we prove a high-dimensional central limit theorem for sums of martingale differences, which may be of independent interest. Finally, we present bounds for high-order moments for the algorithm's last iterate.