Beyond Referring Expressions: Scenario Comprehension Visual Grounding

2026-04-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors study a harder way to teach computers to find objects in pictures by understanding not just names but the situation or story around them. They created a new test called Referring Scenario Comprehension (RSC) where computers must use clues about roles and goals instead of just matching names. This test has many examples, some with unfamiliar objects, and labels to help analyze why models fail. They also propose a new training method, ScenGround, which helps models learn better by starting easy and getting harder. Their results show this new approach finds problems in current models standard tests miss and improves performance.

visual groundingreferring expressionsscenario-based reasoningbenchmarkreinforcement learningcurriculum learningobject rolescontextual cuesout-of-distributionmodel evaluation

Authors

Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo, Vicente Ordonez

Abstract

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

View PDFOpen arXiv