Towards Open-World Referring Expression Comprehension: A Benchmark with Training-free Multi-task Consistency Checker

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present OpenRef, a new benchmark for referring expression comprehension (REC), which means finding objects in images based on descriptions. Unlike earlier tests, OpenRef includes harder scenes like drone views, nighttime, and bad weather, allows multiple or no objects to match, and uses more complex words. They also propose new ways to measure how well models reject incorrect matches. Additionally, they introduce a simple method called Multi-task Consistency Checker to improve model performance without extra training. Their work helps make REC models better suited for real-world, complicated situations.

Referring Expression ComprehensionVision-Language ModelsBenchmark DatasetMulti-target RecognitionGrounding AccuracyF1 ScorePolysemous WordsOrdinal TermsSelf-verificationOpen-world Scenarios

Authors

Zongjian Wu, Lei Zhang

Abstract

Referring expression comprehension (REC) aims to localize a target object within an image based on a given expression. Although recent advances in vision-language models have led to substantial improvements in REC tasks, current REC benchmarks often hold simple scenarios and the assumption that each expression maps to a unique object. These limitations hinder the deployment of REC models in open-world environments. To fill this gap, we introduce OpenRef, a new benchmark for REC in complex visual and linguistic scenarios. OpenRef features three key advancements: 1) Diverse visual scenarios: spanning diverse visual domains, including ground views, drone views, dark scenes and adverse weather conditions; 2) Variable target counts: breaking the single-target limitation with multi-target and none-target samples; 3) Rich vocabulary types: incorporating proper nouns, polysemous words and ordinal terms to fit a wider range of expression needs. Furthermore, as traditional metrics are insufficient for open-world setting, we leverage F1 to measure grounding accuracy and propose N3R (Negative Relative Rejection Reliability) to assess relative rejection reliability against negative expressions. Finally, we introduce Multi-task Consistency Checker (MCC), a training-free but plug-and-play strategy that enhances model performance with one click by enforcing consistency self-verification. Extensive experiments demonstrate that this work significantly advances the performance of existing REC models in complex scenarios, paving the way for open-world REC. Project page: https://zongjianwu.github.io/openref

View PDFOpen arXiv