Adaptive Greedy Frame Selection for Long Video Understanding
2026-03-20 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors address a challenge in long-video question answering where too many video frames make it hard to efficiently find answers. They propose a method that smartly picks key frames by balancing how relevant the frames are to the question and how well they represent different parts of the video. Their approach uses two types of frame embeddings and a special formula to pick frames in a way that guarantees good coverage and relevance. They also create different strategies depending on the question type to improve frame selection. Tests show their method outperforms simpler sampling techniques, especially when only a few frames can be used.
vision-language modelslong-video question answeringframe selectionsemantic representativenessquery relevanceembeddingsubmodular optimizationgreedy algorithmframe samplingquestion classification
Authors
Yuning Huang, Fengqing Zhu
Abstract
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.