Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning

2026-06-22Machine Learning

Machine LearningArtificial IntelligenceComputation and Language
AI summary

The authors address a problem in teaching AI agents to complete tasks that take many steps. Traditional methods look at each run or step separately, missing connections between similar situations. Their new approach, G2PO, groups similar states from different runs into a graph, helping the AI better understand which actions really matter over time. This method reduces confusion from random events and helps the AI learn more effectively, leading to better performance on long tasks. They tested G2PO on several challenging benchmarks and found it outperforms previous methods by a significant margin.

Reinforcement LearningLarge Language ModelsCredit AssignmentState-Transition GraphTemporal Difference LearningAdvantage EstimationLong-Horizon TasksSampling VariancePolicy OptimizationMulti-turn Agentic Tasks
Authors
Yunan Wang, Minghui Song, Zihan Zhang, Shaohan Huang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang
Abstract
Group-based Reinforcement Learning (RL) has significantly enhanced Large Language Models (LLMs) in agentic scenarios. To achieve finer-grained policy updates, recent agentic RL frameworks have shifted from trajectory-level to step-level training. However, long-horizon agentic RL suffers from severe reward sparsity and delay, as feedback is often deferred for dozens of interaction steps. While existing step-level frameworks refine training granularity, their credit assignment remains coarse-grained and still treats agent exploration as isolated, linear trajectories. This oversimplified perspective ignores the inherent graph structure of state transitions, leading to high-variance state-value estimation and myopic, localized credit assignment. To overcome these critical bottlenecks, we propose Group-Graph Policy Optimization (G2PO), a novel group-based RL algorithm tailored for multi-turn agentic tasks. G2PO explicitly transforms linear interaction trajectories into a global state-transition graph. By aggregating identical observations across different trajectories, we introduce group-aggregation state-value estimation that reduces sampling variance and trajectory-dependent bias. Furthermore, we redefine agent actions as transitions between state nodes and propose an edge-centric advantage estimation strategy. By globally standardizing Temporal Difference (TD) errors across the entire graph, G2PO explicitly identifies and prioritizes critical transitions that drive absolute task progress. Extensive experiments on representative long-horizon benchmarks-WebShop, ALFWorld, and AppWorld-demonstrate that G2PO substantially outperforms state-of-the-art prompt-based and RL baselines, achieving remarkable success rate improvements of up to 22.2% over GRPO.