EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

2026-06-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created EvoMemNav, a new way for robots to remember what they see when navigating without prior training. Instead of simplifying scenes into rough maps or using heavy 3D reconstructions, their method stores detailed images linked with simple labels and room layouts, helping the robot recognize objects better and know when to stop. EvoMemNav also smartly narrows down its search areas before checking them carefully, saving time and effort. After each task, it updates its memory to improve future decisions without needing more training. Tests show this approach helps the robot find objects more accurately and explore new places more effectively.

embodied navigationzero-shot learningscene graphs3D reconstructionVisual-Semantic Memory Graphmulti-instance disambiguationtopological relationsvision-language modelsmemory updatelong-horizon planning

Authors

Zuhao Ge, Xiaosong Jia, Chao Wu, Yuchen Zhou, Zuxuan Wu, Yu-Gang Jiang

Abstract

Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.

View PDFOpen arXiv