Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

2026-05-11Machine Learning

Machine Learning
AI summary

The authors revisit common policy gradient methods used in reinforcement learning, which often get stuck in bad solutions because they only look one step ahead. They propose a new method that considers multiple steps at once (k-step) to avoid these short-sighted problems. Their approach is proven to reliably find policies that perform very close to the best possible one, and it works well even in complicated settings like state aggregation and multi-agent systems. Their method also avoids usual technical issues related to distribution mismatches, helping it find better policies without needing perfect exploration.

Policy GradientMarkov Decision Process (MDP)k-step returnsLocal OptimaProjected Gradient DescentMirror DescentState AggregationPartially Observable SettingsDistribution MismatchReinforcement Learning
Authors
Alex DeWeese, Guannan Qu
Abstract
This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$-function. In this work, we propose a generalized $k$-step policy gradient method that couples the randomness within a $k$-step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$. Further, we show projected gradient descent and mirror descent with this $k$-step policy gradient can achieve this exponential guarantee in $O(\frac{1}{T})$ iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors $||d_μ^{π^*} / d_μ^π||_\infty$ and $||d_μ^{π^*} / μ||_\infty$ enabling the $k$-step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.