ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

2026-05-11Robotics

RoboticsComputer Vision and Pattern Recognition
AI summary

The authors address a common problem in zero-shot object navigation where a robot keeps changing its mind about where the target object is or gives up too soon. They propose ConsistNav, a system that helps the robot stick to its plan by remembering clues about the object and managing its actions more carefully. Their method doesn’t change the object detector or movement planner but adds a control layer that decides when to trust the object information or ignore it. Tests show ConsistNav improves success rates and efficiency compared to other methods, and it works well both in simulation and the real world.

Zero-shot Object NavigationOpen-vocabulary DetectorsSemantic ExecutiveFinite-State ControllerPersistent Candidate MemoryStability-Aware Action ControlHM3D DatasetMP3D DatasetSuccess Rate (SR)Success weighted by Path Length (SPL)
Authors
Haosen Wang, Zhenyang Li, Yinqiang Zhang, Zongqi He, Lutao Jiang, Kai Li, Yizhou Zhao, Liaoyuan Fan, Wenjian Hou, Tingbang Liang, Yibin Wen, Defeng Gu
Abstract
Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image--text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.