Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models
2026-06-15 • Robotics
RoboticsArtificial Intelligence
AI summaryⓘ
The authors developed BinTrack, a new open-source method that helps robots answer questions about locations along their path without needing internet or closed-source software. BinTrack works by smartly searching along the robot's route between known landmarks to find the answer. Their method is more accurate and faster than previous open-source options and matches the performance of some closed-source models. They also introduced a new real-world outdoor test dataset called GangnamLoop, collected using a robot on public streets under different conditions.
Spatial Question AnsweringService RobotsEgocentric RouteBinary SearchTrajectory SegmentsOpen-Source ModelsLocalizationInference SpeedOutdoor Robot DatasetLandmarks
Authors
Dongbin Na, Chanwoo Kim, Soonbin Rho, Giyun Choi, Gangbok Lee, Dooyoung Hong
Abstract
This work addresses spatial question answering for service robots traversing long egocentric routes. Given a query such as "where can I find a dry cleaner on the way back home?", the system returns a metric coordinate that downstream navigation components can act on. Prior Spatial Question Answering approaches leverage retrieval-augmented agents built on closed-source models such as GPT-4o for path exploration. However, robots operating in the real world often cannot reliably depend on online closed-source models due to network instability, communication latency, and deployment cost. It creates a need for open-source based Spatial Question Answering approaches that can run onboard the robot, yet prior research in this direction remains limited. This work proposes BinTrack, a simple yet effective, fully open-source spatial-localization agent that leverages the temporal ordering of a robot's trajectory. BinTrack performs a binary search over the trajectory segments between two anchor landmarks identified from a query. It improves overall accuracy by up to 22.8% over other open-source implementations and even matches the reported closed-source model result on the global category of the SpaceLocQA benchmark, the most challenging setting that has so far required strong reasoning agents such as GPT-4o. Furthermore, its optimized inference strategy consistently yields more than a 1.5x inference speedup over previous approaches. Finally, this work releases GangnamLoop, a novel and practical multi-trip outdoor benchmark collected by deploying a real quadruped robot on public streets with the anonymization policy. It revisits the same locations under different outdoor conditions and pairs the robot's low viewpoint with the human owner's. The source codes and datasets are publicly available at https://github.com/ndb796/BinaryTracking