Freeform Preference Learning for Robotic Manipulation

2026-06-30 • Robotics

RoboticsArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors present a new way for robots to learn better by letting people describe what they care about in simple words, like speed or safety, instead of just picking which robot action is better overall. Their method, called Freeform Preference Learning (FPL), uses these descriptions and paired comparisons to create a reward system that understands different qualities. This approach helps robots improve their actions in complicated tasks more effectively than older methods and allows users to guide the robot’s behavior without retraining. The authors tested FPL on real and simulated tasks, showing it works better and offers more flexible control.

reward designrobot policylong-horizon manipulationsparse rewardspreference learninglanguage-conditioned reward modelpairwise preferencesreward-conditioned policyrobot autonomybehavior compositionality

Authors

Marcel Torne, Anubha Mahajan, Abhijnya Bhat, Chelsea Finn

Abstract

Reward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal. We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences. Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis. These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward. We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions. Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points. Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation, shows compositionality of behavior not present in the data, and allows users to steer the policy towards different behaviors at test time without retraining. Blog post with videos available at https://freeform-pl.github.io/fpl.website/

View PDFOpen arXiv