Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors focus on helping robots or digital assistants know when they can't answer a question based on what they see, instead of guessing. They created a method called Semantic Flip that makes fake hard questions by mixing up the video and the question, so the system learns to say "I don't know" when appropriate. This method works with existing models and improves refusal detection without extra data. They also provide a new test called SpaceReject for checking how well systems refuse impossible location questions, where their method performed very well.

embodied agentsvision-language modelsunanswerable queriesout-of-distribution samplesvisual groundingrefusal detectionSemantic FlipSpaceRejectspatial localizationpretrained models

Authors

Dongbin Na, Chanwoo Kim, Giyun Choi, Dooyoung Hong

Abstract

Detecting unanswerable user queries remains essential for the reliable deployment of real-world embodied agents. However, modern vision-language models (VLMs) often generate overly confident answers even when the available visual memory cannot support the query. Such overconfidence poses various task-dependent risks. The agent may provide misleading information to the user in Embodied Question Answering and select an arbitrary coordinate and physically guide the user there in spatial reasoning for navigation. Despite these high stakes, only a few prior studies directly address when and how an embodied VLM should respond with "I do not know." This work proposes Semantic Flip, a simple yet effective framework that synthesizes auxiliary out-of-distribution (OOD) samples for embodied refusal without requiring external OOD annotations. The key idea is to independently transform the query and video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs enable training a lightweight rejection module on top of a frozen pretrained VLM. The module attaches to any existing VLM-based pipeline without retraining the underlying model. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. This work also introduces SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory, where Semantic Flip achieves an $F_1$ score of 0.9559. The source codes and datasets are publicly available at https://github.com/ndb796/SemanticFlip.

View PDFOpen arXiv