Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors study a system called CoVR-R that helps find a new video based on an original video plus an instruction to edit it. The challenge is that the new video is not directly described, so the system must figure out the changes by reasoning about details like objects, actions, and scenes. They build a method using a large language model to describe videos and reason about edits, then combine text matching and embedding comparisons to find the right video. Their approach performs well on tests measuring how often the correct video is retrieved among the top results.

video retrievalzero-shot learninglanguage modelsembeddingTF-IDFedit reasoningstructured descriptionquery embeddingrank at K
Authors
DongQing Liu, MengShi Qi, HongWei Ji
Abstract
CoVR-R studies reason-aware composed video retrieval: given a reference video and an edit instruction, the system must retrieve the target video that satisfies the edit. The main difficulty is that the target is not described directly; it must be inferred from fine-grained changes in object identity, action order, final state, hand interaction, and scene transition. We build a zero-shot reason-then-retrieve pipeline around Qwen3.5-27B. For each gallery video, the model generates a retrieval-oriented structured description and a dense embedding by pooling generated-token hidden states with token-dependent weights. For each query, the model first performs edit reasoning over the reference video and instruction, then generates a target-video description whose hidden states serve as the query embedding. We complement dense retrieval with a TF-IDF branch over the generated texts and fuse the two rankings with split-specific weights. On validation, the current best submission reaches 80.81 at R@1, 94.86 at R@5, 97.11 at R@10, and 98.59 at R@50. On the blind test split, it reaches 89.73 at R@1, 95.79 at R@5, 96.63 at R@10, and 97.98 at R@50.