HandX: Scaling Bimanual Motion and Interaction Generation
2026-03-30 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address the challenge of creating realistic hand and two-hand (bimanual) motions in human movement simulations. They created HandX, a new combined resource that includes cleaned data, a special new dataset focused on detailed finger movements, and a new way to label these motions using AI language models. They used this data to test different AI methods for generating hand motions and introduced new ways to measure motion quality. Their results show that bigger AI models trained on better data produce more natural two-hand movements. The dataset is made available for others to use.
Human motion synthesisBimanual interactionFinger articulationMotion captureLarge language modelsDiffusion modelsAutoregressive modelsMotion annotationHand motion metricsData scaling
Authors
Zimu Zhang, Yucheng Zhang, Xiyan Xu, Ziyin Wang, Sirui Xu, Kai Zhou, Bing Zhou, Chuan Guo, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui
Abstract
Synthesizing human motion has advanced rapidly, yet realistic hand motion and bimanual interaction remain underexplored. Whole-body models often miss the fine-grained cues that drive dexterous behavior, finger articulation, contact timing, and inter-hand coordination, and existing resources lack high-fidelity bimanual sequences that capture nuanced finger dynamics and collaboration. To fill this gap, we present HandX, a unified foundation spanning data, annotation, and evaluation. We consolidate and filter existing datasets for quality, and collect a new motion-capture dataset targeting underrepresented bimanual interactions with detailed finger dynamics. For scalable annotation, we introduce a decoupled strategy that extracts representative motion features, e.g., contact events and finger flexion, and then leverages reasoning from large language models to produce fine-grained, semantically rich descriptions aligned with these features. Building on the resulting data and annotations, we benchmark diffusion and autoregressive models with versatile conditioning modes. Experiments demonstrate high-quality dexterous motion generation, supported by our newly proposed hand-focused metrics. We further observe clear scaling trends: larger models trained on larger, higher-quality datasets produce more semantically coherent bimanual motion. Our dataset is released to support future research.