3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a method to answer really hard questions about long and complex videos taken from many cameras at once. Their approach uses a Video Knowledge Graph to understand how people and events connect over time, and a smart system that breaks down questions step-by-step to find answers efficiently. They showed their method works well without extra training by testing it on huge multi-camera video data. Their work earned them third place in a global challenge.

Video Knowledge GraphMulti-view trackingTemporal reasoningMulti-hop relational reasoningAgentic frameworkZero-shot reasoningLong-form video understandingSpatiotemporal questionsHierarchical retrievalEgo and exo cameras
Authors
Raghad Albusayes, Munirah Alyahya
Abstract
This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiotemporal, and verbal questions, including visual counting, action localization, multi-view tracking and speaker temporal reasoning, within massive, multimodal video streams. The underlying dataset consists of over 600 hours synchronized footage captured by 15 ego and exo camera sources. To tackle the extreme scale and long-context demands of this environment, we introduce a training-free agentic framework optimized for long-form video understanding. Our framework introduces two core architectural components: i) a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to enable multi-hop relational reasoning, and ii) an adaptive agentic workflow that resolves complex queries through a hierarchical retrieval and indexing. Empirical results demonstrate that our framework achieves high zero-shot reasoning accuracy on long-context multi-view streams. Our code will be released at https://github.com/RaghadKhaled/CASTLE-Challenge-Framework.