STEAM: Self-Supervised Temporal Ensemble Advantage Modeling for Real-World Robot Learning

2026-06-29 • Robotics

Robotics

AI summaryⓘ

The authors propose a method called STEAM to help robots learn better from real-world data, which often includes both good and bad parts. STEAM teaches itself to recognize progress and setbacks by examining pairs of moments in expert demonstrations without needing extra labels. It uses multiple predictors to conservatively evaluate how well the robot is doing at each moment. When used with an existing learning method, STEAM significantly improves the robot's success in various tasks like folding towels and picking objects. Essentially, the authors provide a way for robots to better tell when they are making progress or getting stuck during learning.

robot learningexpert demonstrationsself-supervised learningtemporal offsetadvantage modelingensemble methodspolicy learningrollout databimanual tasksCFGRL

Authors

Zhihao Liu, Qiuyi Gu, Yitao Wang, Dongming Qiao, Yixian Zhang, Shuaihang Chen, Liangzhi Shi, Tianxing Zhou, Zefang Huang, Kang Chen, Zhen Guo, Quanlu Zhang, Jincheng Yu, Xiaodan Liang, Guoliang Fan, Yu Wang, Feng Gao, Xinlei Chen, Chao Yu

Abstract

Real-world robot learning increasingly relies on heterogeneous data, but demonstrations and rollouts often mix useful progress with stalls, corrections, and suboptimal behavior. Effective policy learning therefore requires frame-level advantages that distinguish reliable local progress from failures and regressions. We propose Self-supervised Temporal Ensemble Advantage Modeling (STEAM), a label-free method that learns such advantages from expert demonstrations. STEAM trains an ensemble of temporal-offset predictors on frame pairs within expert trajectories, using the normalized temporal offset between two frames as a self-supervised signal. Each predictor maps a frame pair to a distribution over temporal offsets, which is converted into a scalar advantage. STEAM then takes the minimum advantage across the ensemble to score mixed-quality rollout data conservatively. Across real-world bimanual towel folding, chip checkout, cola restocking, and single-arm pick-and-place tasks, STEAM identifies stalls, failures, and recoveries. When combined with CFGRL, STEAM further improves policy success rate by 59%, 54.3%, 23% and 16.2% over baselines, respectively.

View PDFOpen arXiv