Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

2026-06-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors tackle a video search challenge where the goal is to find a video that matches a reference video but changes based on a text instruction. They first use a pre-trained visual model called DINOv3 to narrow down possible videos without any new training. Then, they use large vision-language models to check if those videos fit the instructions. Finally, they refine the best guesses using reasoning steps. Their method works fairly well without retraining and could get better with improved models and deeper combining of video and language understanding.

video retrievalvision-language modelsDINOv3Recall@1Recall@5composed video retrievalreference videomodification instructiontraining-free methodsreasoning refinement

Authors

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang

Abstract

Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.

View PDFOpen arXiv