MOCHI: Motion Enhancement of Collaborative Human-object Interactions

2026-06-16Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionGraphicsRobotics
AI summary

The authors address the difficulty of capturing accurate data when multiple people interact with objects together, which often results in noisy and incomplete recordings. They introduce MOCHI, a two-step method that first improves hand poses to make them realistic and well-matched to body movements, then refines full-body motions by reducing noise using learned motion patterns. Their approach also considers how people interact with each other and objects during optimization. Tests show that MOCHI works well on different datasets and can handle various numbers of participants and interaction types, with uses like creating motion keyframes and augmenting data with different objects.

multi-human object interactionmotion capturehand-object graspmotion optimizationdiffusion modelsmotion priorsdata augmentationkeyframe animationphysical plausibilityarticulation
Authors
Jiye Lee, Yonghun Choi, Jungdam Won
Abstract
Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.