EAGLE-360: Embodied Active Global-to-Local Exploration in 360$^\circ$

2026-07-02Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors found that current multimodal language models have trouble searching for things in 360-degree panoramic views because they can't handle the special warped and continuous nature of these images. To fix this, they created EAGLE-360, which uses a smart way to look at the whole panorama first and then zoom in on likely target areas instead of searching blindly. They also built a large dataset and trained their system to improve spatial understanding and searching skills. Their method made searching much more accurate and efficient compared to previous approaches.

Multimodal Large Language Models360-degree panoramic imagespolar distortioncylindrical topologyRoPE Rolling positional encodingVisual Question AnsweringSupervised Fine-TuningPolicy OptimizationActive visual searchEmbodied AI
Authors
Jingtao Xu, Zizhuo Lin, Jianwen Sun, Yi Yang, Yawei Luo
Abstract
While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360$^\circ$ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challenges, we propose EAGLE-360, a novel Embodied Active Global-to-Local Exploration framework. Rather than performing exhaustive local searches, EAGLE-360 leverages global priors to establish an initial holistic perspective, iteratively reasoning and progressively narrowing the search space. Architecturally, we adapt RoPE Rolling, a coordinate-shifting positional encoding mechanism, to seamlessly model the continuous topologies of panoramas. To facilitate this paradigm, we construct the large-scale EAGLE-360 dataset, comprising 14,000+ 4K panoramas and 70,000+ rounds of high-quality VQA dialogues. By employing a training pipeline that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), we effectively elicit complex spatial reasoning and tool-calling capabilities. Extensive experiments demonstrate that EAGLE-360 establishes a new state-of-the-art for 360$^\circ$ visual search, achieving nearly an 8-fold increase in accuracy over the base model while significantly enhancing exploration efficiency.