Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning
2026-06-15 • Robotics
Robotics
AI summaryⓘ
The authors developed a method called VOTP that helps teaching robots complex tasks without needing lots of human feedback. Instead of manually designing rewards or getting many labels, VOTP uses a smart way to compare video behaviors using pretrained video models and optimal transport math. This lets it create helpful hints from only a few human examples and apply these to many unlabelled cases. Their tests show VOTP works better than other methods when feedback is limited, even in noisy visual settings and real robot tasks.
Reinforcement LearningReward EngineeringPreference-based RLVideo Foundation ModelsOptimal TransportSemi-supervised LearningPseudo-labelingRobotic ManipulationLocomotion TasksHuman Feedback
Authors
Tung M. Luu, Hwanhee Kim, Younghwan Lee, Chang D. Yoo
Abstract
Conveying complex objectives to reinforcement learning (RL) agents often requires meticulous reward engineering. Preference-based RL (PbRL) offers a promising alternative by learning reward functions from human feedback, but its scalability is hindered by high labeling costs. Inspired by advances in Video Foundation Models (ViFMs), we present Video-based Optimal Transport Preference (VOTP), a semi-supervised framework that learns effective reward functions from only a handful of labels. By leveraging optimal transport to align visual trajectories within the rich representation space of ViFMs, VOTP effectively generates high-fidelity pseudo-labels for large amounts of unlabeled data, substantially reducing human supervision. Extensive experiments across locomotion and manipulation benchmarks demonstrate the superiority of VOTP, which outperforms state-of-the-art offline PbRL methods under limited feedback budgets. We also showcase the robustness of VOTP in the presence of visual distractors and validate its utility on real robotic tasks, where it learns meaningful rewards with minimal human input.