FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

2026-04-17 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics

AI summaryⓘ

The authors created FineCog-Nav, a new way for drones to follow complicated instructions while flying through 3D spaces by breaking down the task into smaller parts like understanding language, seeing, remembering, imagining, and deciding. Each part uses a smaller AI model specialized for that role, which helps the system work together better and be easier to understand. They also made a new test set called AerialVLN-Fine to better measure how well the drone follows detailed instructions over long distances. Their tests show FineCog-Nav does better than other methods that don't learn in advance, especially in following instructions closely and handling new environments.

UAVVision-language navigationZero-shot learningModular AIEgocentric perspective3D navigationInstruction followingAerialVLNFoundation modelsCognitive architecture

Authors

Dian Shao, Zhengzheng Xu, Peiyang Wang, Like Liu, Yule Wang, Jieqi Shi, Jing Huo

Abstract

UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: https://smartdianlab.github.io/projects-FineCogNav.

View PDFOpen arXiv