MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

2026-06-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors studied how agents that answer questions about 3D scenes use different types of data like images and point clouds. They noticed that current models only focus on one type of data, which might not always be the best choice. To fix this, they created MASER, a system that can pick the best kind of data to use for each question by learning from examples. Their tests showed that no single data type works best for every question, and MASER's smart selection improves performance by choosing the right data type each time.

3D environmentsEmbodied AgentsVision-Language Modelspoint cloudsmodality adaptationneural routing policysentence transformerMulti-layer PerceptronOpen3D-VQA benchmark

Authors

Hilton Raj, Vishnuram AV

Abstract

In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal -- point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.

View PDFOpen arXiv