EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors found that improving video temporal grounding models often works well on familiar data but fails on new types of videos because the models struggle to connect what they see with the specific objects or entities involved. To fix this, they propose EVIDENT, a method that helps the model focus on clear visual entities when matching descriptions to video moments. Their approach includes compressing visual info into entity slots, teaching the model to recognize objects properly, and using these objects as evidence to better find relevant moments in different videos. Tests show their method works better across various video types without losing accuracy on known data.

Multimodal Large Language ModelsVideo Temporal GroundingDomain ShiftEntity AttentionEntity Bottleneck AdapterDistillation LossVisual EntitiesTemporal LocalizationCross-domain RobustnessParameter-efficient Adaptation
Authors
Geo Ahn, Jiwook Han, Youngrae Kim, Joonseok Lee, Jinwoo Choi
Abstract
Fine-tuning MLLMs for Video Temporal Grounding (VTG) often improves in-domain performance but degrades sharply under domain shift. In this work, we find that this failure is primarily driven not just by unseen query concepts, but by visual domain shift, which prevents the model from coupling its learned temporal localization knowledge with its inherent entity-attention capability. To address this, we introduce EVIDENT, a parameter-efficient adaptation framework that anchors temporal grounding in the inherent entity-attention of pre-trained MLLMs by routing VTG adaptation through explicit visual entity evidence. EVIDENT consists of three components: (i) an Entity Bottleneck Adapter that transforms dense visual tokens into compact entity-level slots, (ii) an Entity-Binding Distillation loss that instills objectness priors into the semantically unstructured MLLM visual space, guiding each slot to bind to a coherent entity, and (iii) an Entity-to-eVidence gating mechanism that leverages the captured entities as evidence, steering the model to localize moments containing query-relevant entities. Together, these components enable VTG fine-tuning to rely on entity-grounded evidence rather than brittle dataset shortcuts. Experiments on cross-domain VTG benchmarks show that EVIDENT consistently improves out-of-domain robustness while preserving competitive in-domain performance with modest parameter overhead. These results suggest that entity-level grounding is an effective inductive bias for generalizable temporal localization.