IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

2026-06-08Artificial Intelligence

Artificial IntelligenceComputer Vision and Pattern RecognitionMultimedia
AI summary

The authors noticed that current tests for AI models that understand and create images and text together don’t check how well they handle back-and-forth conversations. To fix this, they created IMUG-Bench, a new and detailed test set with many examples of multi-turn image-text chats. They used it to study various popular AI models, finding where these models struggle, especially with mistakes that build up over several exchanges. The authors also tried some smart ways to improve the AI’s responses during testing, which helped reduce errors and made the models better at long conversations.

Unified Multimodal ModelsMulti-turn DialogueExposure BiasImage-Text InteractionBenchmarkDynamic UnderstandingChain-of-ThoughtSelf-VerificationBest-of-N Sampling
Authors
Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai
Abstract
In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.