Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

2026-06-01 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors address a problem in reinforcement learning where waiting for slow group members (stragglers) delays progress during training. They develop a method called Straggler-Aware Group Control (SAGC) that adjusts the group size dynamically to reduce these delays while still benefiting from having larger groups. Their approach improves training speed and maintains or improves learning performance. Additionally, models trained with SAGC perform as well or better on later tasks and often produce shorter outputs without extra penalties.

reinforcement learningsynchronous trainingGroup Relative Policy Optimization (GRPO)stragglersdynamic group sizeonline optimizationreward computationtraining efficiencymodel qualityon-policy RL

Authors

Azal Ahmad Khan, Ammar Ahmed, Zeshan Fayyaz, Sheng Di, Mingyi Hong, Ali Anwar

Abstract

Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.

View PDFOpen arXiv