OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors propose OmniJigsaw, a new method to improve how AI understands and reasons with videos and sounds together by training it to put mixed-up clips back in the right order. They use three strategies to mix visual and audio information in a way that forces the AI to learn from both together. They also create a special filtering process to prepare large amounts of unlabeled data for training. Their tests show that one strategy, clip-level modality masking, works better than others by avoiding shortcuts the AI might use. Overall, their approach improves the AI's ability in video, audio, and combined reasoning tasks.

Reinforcement LearningSelf-Supervised LearningOmni-modal ModelsTemporal ReorderingAudio-Visual IntegrationProxy TaskData FilteringCross-Modal LearningModality MaskingCollaborative Reasoning

Authors

Yiduo Jia, Muzhi Zhu, Hao Zhong, Mingyu Liu, Yuling Xi, Hao Chen, Bin Qin, Yongjie Yang, Zhenbo Luo, Chunhua Shen

Abstract

To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.

View PDFOpen arXiv