MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose MAVIS, a new way to find videos based on text that avoids slow searching through everything. Instead of scanning all videos, MAVIS breaks videos into detailed pieces and uses multiple agents to handle smaller tasks. These agents work together and debate to remove bad matches before deciding on the best videos, making the process faster and easier to understand. Their method performs well on popular video datasets without needing special training for each task.

video retrievalmulti-agent systemsemantic indexingstructured semantic libraryquery decompositionlogic-aware debateveto protocoldual-encoderMSR-VTTActivityNet
Authors
Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo
Abstract
The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial'' candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.