RescueBench: Can Embodied Agents Save Lives in the Wild ?

2026-06-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created RescueBench, a new test that simulates search-and-rescue missions as a series of four steps: exploring an area, rescuing a target, remembering the way back, and completing the handoff. This benchmark helps check how mistakes in early steps affect later ones, something previous tests did not do together. They tested different methods and people, finding that no current automated system can finish the hardest mission, mainly struggling with exploration and memory. Their findings show that current navigation and mapping techniques aren't enough to solve these challenges.

search-and-rescueembodied agentsmultimodal uncertaintyspatial memorynavigationexplorationvisual-language navigationmap-based methodsbenchmarktask pipeline

Authors

Kui Wu, Beiyu Guo, Hao Chen, ShuHang Xu, Yuling Li, Yongdan Zeng, Zhoujun Li, Yizhou Wang, Fangwei Zhong

Abstract

Search-and-rescue (SAR) requires embodied agents to explore unfamiliar environments under multimodal uncertainty, perform multi-stage interactions, and retrieve spatial memory over long horizons. Existing benchmarks typically evaluate these capabilities in isolation, leaving unclear how failures compound when they must be composed in realistic workflows. We introduce RescueBench, a photo-realistic diagnostic benchmark that instantiates SAR as a four-stage pipeline: multimodal exploration, target rescue, memory-guided return, and final handoff. By combining sequential task composition with stage-level evaluation, RescueBench enables analysis of how exploration and memory failures propagate through embodied rescue workflows. It contains five progressive difficulty levels that vary in environmental complexity, clue ambiguity, and spatial hierarchy, along with an automatic episode generation and annotation pipeline for scalable evaluation and training. We evaluate seven baselines, an oracle reference, and human players, showing that no baselines complete the full task at the greatest difficulty. Stage-level diagnosis identifies autonomous exploration as the dominant failure mode and spatial memory as a second, independent bottleneck, suggesting that these limitations are not resolved by current topological visual-language navigation or map-based methods. Code is available in https://github.com/wukui-muc/RescueBench

View PDFOpen arXiv