LensWalk: Agentic Video Understanding by Planning How You See in Videos

2026-03-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors identify a problem in video understanding: current models look at pre-processed snapshots and can't explore videos dynamically as they think. They propose LensWalk, a system that lets a language model decide when and how to watch parts of a video step-by-step to gather evidence. This approach helps the model to better focus, check facts, and combine information from different moments in the video. Their method improves accuracy on tough long-video tasks without needing extra training. The key insight is that letting the model control how it views video data leads to better and clearer reasoning.

Vision-Language ModelsLarge Language ModelsVideo UnderstandingTemporal SamplingAgentic FrameworkChain of ThoughtVideo ReasoningEvidence GatheringSelf-directed PerceptionLong-video Benchmarks
Authors
Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan
Abstract
The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.