MotionHalluc: Diagnosing Kinematic Hallucinations in Fine-Grained Motion Reasoning
2026-06-22 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors studied how computers give feedback on differences between two videos showing motions. They found that current models often make mistakes by imagining motions that aren't really there, which they call 'motion hallucinations.' To better understand this, they created a new test called MotionHalluc with lots of questions about different types of hallucinations. They also introduced a simple method called PPV that helps check if the feedback is accurate by using actual measurements from the videos, which improved the models’ performance. Their work shows that using concrete motion data helps reduce errors when comparing motions across videos.
motion instruction generationcross-video comparisonmotion hallucinationkinematic differencesMotionHalluc benchmarkdirectional hallucinationattributional hallucinationtemporal hallucinationPerceive-Parse-Verify (PPV)multimodal models
Authors
Weile Guo, Shenghong He, Danying Mo, Chengdong Xu, Xuexun Liu, Chao Yu
Abstract
Motion instruction generation in cross-video comparison aims to produce corrective feedback that describes the differences between a query and a reference motion. However, existing models often generate instructions that exhibit motion hallucinations, failing to reflect actual kinematic differences between paired videos. To systematically investigate these hallucinations, we introduce MotionHalluc, a dedicated benchmark for evaluating motion hallucinations in paired-video comparison. MotionHalluc comprises 1540 fine-grained questions over 553 video pairs, evaluating hallucinations along three core dimensions: (1)directional hallucination, (2)attributional hallucination, and (3)temporal hallucination. Extensive evaluations of state-of-the-art large multimodal models demonstrate high susceptibility to these hallucinations. Furthermore, we provide Perceive-Parse-Verify (PPV) as a training-free measurements extraction and verification baseline that converts candidate instructions into executable measurement queries and supplies kinematic measurements at inference time. Our results show that this simple measurements injection yields an average 10.6% performance gain across models, suggesting that motion reasoning with explicit quantitative measurements is a key factor in reducing hallucinations in cross-video comparison. Our code and dataset will be made publicly available upon acceptance.