DexPIE: Stable Dexterous Policy Improvement from Real-World Experience

2026-06-08 • Robotics

RoboticsComputer Vision and Pattern Recognition

AI summaryⓘ

The authors address the difficulty of teaching robotic hands to perform complex tasks by imitation alone, which often needs lots of expert examples and still makes mistakes. They create DexPIE, a method that improves the robot's skill after initial training by learning from its own real-world experiences. Their approach uses a special way to gather better feedback during practice and aligns new experiences more closely with the demonstrations to help evaluate and improve the robot's choices. Testing their method on three hard tasks, they show it works noticeably better than just copying demonstrations. The authors also plan to share their code and data for others to use.

dexterous manipulationimitation learningDAggerpolicy improvementasynchronous inferencevalue functionreinforcement learningrobotic handscontact-rich dynamics

Authors

Ruizhe Liao, Wenrui Chen, Liangji Zeng, Haoran Lin, Fan Yang, Kailun Yang, Yaonan Wang

Abstract

Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to achieve reliable performance. To move beyond the limitations of demonstration data, in this work, we propose DexPIE, a post-training framework for dexterous policy improvement from experience collected through real-world deployment. First, DexPIE enables effective exploration coverage through a dexterous-hand-adapted intervention system and multi-stage DAgger-style data collection across initial and intermediate task stages, providing reliable supervision for accurate policy evaluation. To reduce temporal noise between post-training rollouts and demonstration data, we introduce asynchronous inference in the relative action space, which better aligns rollout data with demonstrated behavior and allows the critic to learn a value function induced by a more consistent underlying policy. Finally, DexPIE improves the policy through conditioning on a continuous optimality indicator, allowing the policy to leverage the quality of data in a more fine-grained manner. Across three challenging real-world dexterous manipulation tasks, DexPIE achieves a 37% improvement in success rate over the demonstration-based reference policy, outperforming all baseline methods and demonstrating stronger robustness. The source code and dataset will be made publicly available.

View PDFOpen arXiv