UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning
2026-06-10 • Robotics
RoboticsMachine Learning
AI summaryⓘ
The authors present UniIntervene, a new system that helps robots learn better by reducing the need for constant human corrections during training. Instead of humans stepping in all the time, UniIntervene predicts when the robot is stuck in unproductive behavior and automatically guides it back to better actions. It does this by estimating the future value of actions and using past successful corrections to recover. Their tests showed UniIntervene improved success rates while cutting human involvement almost in half compared to earlier methods.
Human-in-the-loop reinforcement learningRobotic manipulationPolicy improvementIntervention modelAction-value estimationTemporal value-risk criticGoal-conditioned policyRecovery actionsUnproductive explorationReal-world reinforcement learning
Authors
Haoyuan Deng, Yitong Gao, Yudong Lin, Haichao Liu, Zhenyu Wu, Ziwei Wang
Abstract
Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human corrections to redirect the policy out of unproductive exploration, which incurs high labor cost and limits real-world scalability. To address this, we propose UniIntervene, an agentic intervention model that detects unproductive exploration and autonomously recovers the policy toward high-value states, taking over the bulk of interventions from human operators. Specifically, UniIntervene first performs future-conditioned action-value estimation, predicting the latent consequence of the current action and evaluating its induced value, which provides a more stable progress signal. Building on this, a temporal value-risk critic aggregates recent value dynamics and triggers intervention when the estimated value exhibits sustained stagnation or degradation. When intervention is required, UniIntervene retrieves a high-value recovery target from a memory of past intervention episodes and produces executable corrective actions through a goal-conditioned recovery policy. In this way, UniIntervene turns intervention from passive human correction into a value-aware recovery process for efficient real-world RL. Extensive experiments on diverse real-world manipulation tasks demonstrate that UniIntervene improves the average success rate by 8.6% while reducing human interventions by 57% relative to state-of-the-art HiL-RL baselines.