PS-PPO: Prefix-Sampling PPO for Critic-Free RLHF

2026-06-29Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors focus on improving a way to train language models using feedback from humans, called Reinforcement Learning from Human Feedback (RLHF). They point out that current critic-free methods update all steps in a response equally, which wastes computing power. Their new method, PS-PPO, smartly chooses a stopping point in each response and only updates the model based on that shorter part, saving computing resources. Experiments show PS-PPO uses less training time and memory while keeping the model's performance similar to current methods.

Reinforcement LearningHuman FeedbackLarge Language ModelsCritic-free MethodsProximal Policy OptimizationTrajectory TruncationImportance WeightingPolicy UpdateMathematical Reasoning BenchmarkGPU Memory Efficiency
Authors
Doo Hwan Hwang, Kee-Eung Kim
Abstract
Reinforcement Learning from Human Feedback (RLHF) for Large Language Models increasingly relies on critic-free methods as a practical alternative to actor--critic training. Despite their simplicity, existing critic-free approaches propagate a trajectory-level learning signal uniformly across all tokens in a trajectory. This requires full-trajectory policy updates for every rollout, leading to substantial optimization cost for long reasoning traces, even though intermediate prefixes often contain enough information to largely determine the final outcome. We propose Prefix-Sampling Proximal Policy Optimization (PS-PPO), a compute-efficient critic-free method for RLHF that exploits this temporal redundancy. PS-PPO introduces a prompt-conditioned cutoff distribution and samples a cutoff timestep for each trajectory. During the update pass, PS-PPO backpropagates only through the sampled prefix of each trajectory and applies an importance-weighting correction so that the resulting truncated gradient estimator remains unbiased with respect to the full-trajectory objective. Experiments on mathematical reasoning and RLHF benchmarks show that PS-PPO achieves large reductions in training compute and peak GPU memory, while maintaining accuracy comparable to strong critic-free baselines.