Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

2026-05-25Machine Learning

Machine Learning
AI summary

The authors study how multimodal large language models that generate text alongside images make decisions about which words to predict at once. They find that picking words based only on confidence can cause redundant visual information to be repeated, which hurts later predictions. To fix this, they create a way to measure this redundancy and a new decoding method that encourages choosing words linked to different visual parts. Their method improves accuracy on various tests without extra training.

multimodal large language modelsdiffusion-based decodingtoken predictionvisual groundingconfidence-based decodingVisual Redundancy Index (VRI)Visual-Redundancy-Controlled Decoding (VRCD)token-to-image attentionM3CoTMMBench
Authors
Yulin Yuan, Hongshuo Zhao, Xiangming Meng
Abstract
Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation, but also which positions should be committed together as context for later decoding steps. Existing confidence-based decoding ranks masked positions independently and commits the top-K positions, largely ignoring whether the committed tokens provide complementary visual grounding. We identify a step-level limitation of this strategy in multimodal settings: high-confidence tokens selected in the same step can rely on overlapping visual grounding, introducing visual redundancy among the committed tokens and leaving less complementary visual grounding available for later decoding. To quantify this effect, we introduce the Visual Redundancy Index (VRI), which measures visual grounding overlap among tokens committed in parallel. To control this redundancy during decoding, we propose Visual-Redundancy-Controlled Decoding (VRCD), a training-free inference-time decoding method that uses token-to-image attention to prioritize visually complementary positions. Across diverse multimodal benchmarks, VRCD reduces visual redundancy and remaining-position entropy with modest runtime overhead. In longer decoding experiments, it also achieves relative accuracy gains of up to 18.8% on M^3CoT and 6.9% on MMBench over confidence-based decoding. Code will be released at https://github.com/infiniteYuanyl/VRCD.