MUNI: Multimodal Unified Latent Diffusion for Coherent Any-to-Any Generation

2026-06-15 • Machine Learning

Machine Learning

AI summaryⓘ

The authors present MUNI, a new system that can generate content across different types of data (like images, text, and audio) from any input to any output all in one framework. Unlike past models that need text paired with other data or separate training steps, MUNI trains everything together using a shared hidden space, making it more flexible. They also improved how the model learns this shared space so it stays consistent and works well even when only some types of data are given. Their tests show MUNI matches or beats existing methods, especially when generating multiple types of data at once without extra conditions.

multimodal generationlatent diffusionany-to-any generationvariational inferenceflow-based priorlatent spacemodalitiesconditional generationshared representationexpressive decoders

Authors

Kyeongmin Yeo, Yunhong Min, Minhyuk Sung

Abstract

We introduce MUNI, an end-to-end multimodal latent diffusion framework for any-to-any generation that unifies subset-conditioned cross-modal generation and unconditional joint sampling through a shared stochastic latent. Existing multimodal generative models are largely LLM-based, which limits leveraging modality-specific generators and requires text-paired data for training. Recent diffusion- and flow-based any-to-any extensions take a different direction but still rely on text-aligned embeddings, fully-paired training, or matched-dimensionality deterministic mappings. MUNI rests on two complementary contributions, one architectural and one in the training objective. First, we extend latent diffusion to multimodal any-to-any generation end-to-end: instead of the standard two-stage recipe that precomputes a frozen latent space and then fits a prior over it, MUNI jointly trains modality-specific encoders, expressive decoders, and a single shared flow-based prior under one objective. Second, we identify that the standard aggregation rules of multimodal variational inference are insufficient once coupled with a learned prior and expressive decoders. A suitable shared latent must simultaneously satisfy coherence across generated modalities, predictive sufficiency of subset latents, and minimality of the latent content. We propose a routed training objective whose structural choices align the latent with these criteria and admit a minimal-sufficiency characterization in the realizable setting. Experiments on PolyMNIST-Quadrant-Labels and a large-scale image-text-audio benchmark show MUNI matching or exceeding the strongest baselines on conditional generation while opening its largest margins on unconditional coherence. Project page: https://muni-proj.github.io/.

View PDFOpen arXiv