Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors tackle a video search challenge where the goal is to find a video that matches a reference video but changes based on a text instruction. They first use a pre-trained visual model called DINOv3 to narrow down possible videos without any new training. Then, they use large vision-language models to check if those videos fit the instructions. Finally, they refine the best guesses using reasoning steps. Their method works fairly well without retraining and could get better with improved models and deeper combining of video and language understanding.

video retrievalvision-language modelsDINOv3Recall@1Recall@5composed video retrievalreference videomodification instructiontraining-free methodsreasoning refinement
Authors
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Qingming Huang
Abstract
Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.