ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection
2026-06-03 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors study how to tell real videos from AI-made fake ones by looking at the mistakes a model makes when trying to recreate the video. They find that real and fake videos show different patterns of errors over time. To use this, they build a system called ReConFuse that combines these error patterns with video content information to spot AI fakes better. Tests show their method works well even with new types of generated videos.
AI-generated videosvideo forensicsreconstruction errorWF-VAEtemporal dynamicssemantic featuresvideo classificationgeneralizationMamba modulemultimedia authentication
Authors
Xiaojing Chen, Xinyu Lu, Changtao Miao, Yunfeng Diao
Abstract
AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.