LightSTAR: Efficient Visual Document Retrieval via Lightweight Selection with Vision-Adaptive Refinement
2026-06-22 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address the problem of finding relevant pages in large collections of visual documents based on user queries. They observe that queries often contain important keywords that appear directly in the text, which can help quickly narrow down candidate pages without heavy computation. Their proposed method, LightSTAR, first uses a fast, lightweight step to select possible matches without involving complex models, then applies a detailed semantic analysis only on these candidates to improve accuracy. Their approach achieves top retrieval accuracy while being much faster than previous methods.
Visual Document RetrievalMulti-modal Large Language Models (MLLMs)Query EncodingSemantic MatchingContent-grounded EncodingContrastive LearningFeature FusionLayout AnalysisCandidate SelectionEnd-to-end Latency
Authors
Tongkun Guan, Haocheng Wang, Wei Shen, Xiaokang Yang
Abstract
Visual document retrieval requires rapidly locating relevant pages from large multi-modal corpora in response to user queries. While recent methods powered by Multi-modal Large Language Models (MLLMs) show competitive accuracy, they suffer from prohibitive computational costs by applying intensive MLLM encoding to every single page. Meanwhile, we observe that user queries are typically keyword-anchored, containing semantically rich words that are expected to appear directly in the visible text of relevant pages, offering an efficient cue for quickly narrowing down candidate pages. Building on this insight, we propose LightSTAR, an efficient framework that decomposes visual document retrieval into: 1) LLM-free Visual Selection, which utilizes content-grounded query encoding to focus on informative words and employs LLM-free visual embeddings to produce a high-recall candidate set; and 2) Vision-adaptive Semantic Refinement, which further performs fine-grained semantic matching exclusively on these top candidates via adaptive region-wise feature fusion to effectively combine textual and layout cues, optimized through a hardness-aware contrastive objective. Experimental results demonstrate that LightSTAR achieves state-of-the-art retrieval accuracy while reducing end-to-end latency by several-fold, offering a highly practical solution to the accuracy-efficiency trade-off in visual document retrieval. Code is available at https://github.com/bokufa/LightSTAR.