V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

2026-06-15Robotics

RoboticsComputer Vision and Pattern Recognition
AI summary

The authors created V2P-Manip, a system that learns how to perform complex hand movements by watching videos of people doing tasks. Their method turns regular videos into 3D hand movements that are both visually accurate and physically possible. They improve the results by refining the hand poses in two steps to better match real-world constraints. Tests show their system works well on standard datasets and can adapt to different robotic hands. Overall, their approach helps robots learn delicate manipulation skills more efficiently from just videos.

dexterous manipulationteleoperationmonocular video3D trajectory estimationpolicy learningspatial alignmentphysical consistencyTACO benchmarkOakInk benchmarkembodied AI
Authors
Kaihan Chen, Yanming Shao, Haifeng Ji, Xiaokang Yang, Yao Mu
Abstract
Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.