Deep Q-Learning on Hölder Spaces

2026-06-15Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors analyze the core mathematical part of Q-learning when dealing with continuous-time systems where both states and actions change smoothly. They focus on the Bellman optimality update, showing it smooths out state information but keeps a simpler dependence on actions. This insight helps them design a neural network architecture suited for this mixed smoothness and estimate how well it can approximate solutions. Their results advance understanding of Q-learning in continuous settings but do not include a full proof of practical algorithm convergence.

Q-learningBellman optimalitycontinuous-time stochastic controlHölder regularityanisotropic regularityDeepONetuniform ellipticityvalue-based reinforcement learningdiffusion processesapproximation complexity
Authors
Qian Qi
Abstract
We study the operator-theoretic core of Q-learning in continuous-time stochastic control with continuous states and actions. In value-based reinforcement learning, each Q-learning or DQN update is built from a Bellman optimality target; our analysis isolates this target in a diffusion setting and studies its regularity and approximation complexity. Under uniform ellipticity and Hölder-regular coefficients, we show that a Bellman update maps bounded inputs into an anisotropic regularity class, smoothing the state variable while leaving only Lipschitz dependence on the action variable. This yields a compact family of Bellman iterates and motivates a tensor-product DeepONet architecture adapted to the mixed regularity of the problem. We then derive explicit approximation and resource bounds, together with a stiffness--complexity trade-off as the time step $δ\to 0$. The resulting theory makes a direct contribution to Q-learning theory at the level of Bellman target regularity and approximation in continuous stochastic control. At the same time, we do not claim a full convergence theorem for practical sampled Q-learning with exploration, replay, and stochastic gradient updates.