DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning
2026-06-29 • Computation and Language
Computation and Language
AI summaryⓘ
The authors present DAIN, a new system that improves how computers combine information from different sources (like images, text, or medical data) by treating the process like a team of specialists working together. DAIN uses a controller to smartly decide which experts are needed for each task and encourages them to share concise information to reach better decisions efficiently. Tested on various datasets, DAIN showed better accuracy and clearly demonstrated the importance of its dynamic scheduling and communication between these expert agents. It is also designed to be interpretable, revealing how different experts contribute depending on the context.
Multimodal FusionMixture-of-Experts (MoE)Dynamic SchedulingMeta-ControllerSparse ActivationAgent CommunicationMulti-objective LossInterpretabilityBenchmark EvaluationCollaborative Reasoning
Authors
Xinxin Chen, Yuchen Li, Zihan Wang, Haoyu Zhang, Ruixin Liu, Mingyuan Zhao
Abstract
Current multimodal fusion approaches, particularly those based on static Mixture-of-Experts (MoE) architectures, often struggle to provide the adaptive and efficient collaborative reasoning required by complex real-world applications. We introduce the Dynamic Agent-based Interaction Network (DAIN), which reconceptualizes multimodal fusion as a dynamic, multi-agent collaborative process. DAIN employs a context-aware Meta-Controller that dynamically schedules sparse activation of specialized interaction agents and orchestrates compressed inter-agent communication for consensus-building. The framework is guided by a multi-objective loss function that jointly optimizes task accuracy, agent specialization, and operational efficiency through sparse activation and communication regularization. Comprehensive evaluations across five diverse benchmarks -- ADNI, MIMIC-IV, MM-IMDB, CMU-MOSI, and ENRICO -- establish DAIN as a new state-of-the-art, delivering significant performance improvements including a 2.6\% accuracy gain on ADNI. Ablation studies verify the critical roles of both dynamic scheduling and agent communication. Furthermore, DAIN offers enhanced interpretability by exposing context-dependent agent roles and collaboration patterns while maintaining computational efficiency through sample-wise sparse agent activation. Our work demonstrates the promise of dynamic, agent-based paradigms for multimodal reasoning.