Illuminating Unified Multimodal Model for Free-form Interleaved Text-Image Generation

2026-06-29 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors developed ILLUME-X, a new AI system that can create sequences mixing text and images together in a natural way. To do this, they improved the training data, used a special training method that adapts as the model learns, and created a way to fairly measure how well the model generates mixed text-image content. Their approach showed better results than previous models on tasks like changing image style, breaking images into parts, and telling stories with images and text. This work helps AI better handle tasks that need both pictures and words combined.

generative AImultimodal intelligenceinterleaved text-image generationtraining data pipelineprogressive trainingself-adaptive objectivesevaluation metricsstyle transferimage decompositionstorytelling

Authors

Chonghuinan Wang, Zhikai Chen, Chunwei Wang, Yecong Wan, Junwei Yang, Zhixin Wang, Wei Zhang, Jiaqi Xu, Renjing Pei, Xiaohe Wu, Fan Li, Wangmeng Zuo

Abstract

The advancement of generative AI models capable of producing text and image marks a critical step forward in the realm of multimodal intelligence, particularly for tasks involving the interleaving of both modalities. To advance this intelligence to the next stage, it is crucial for models to autonomously generate free-form interleaved text-image sequences. In this paper, we introduce ILLUME-X, an advanced unified multimodal paradigm that enables high-quality, free-form interleaved text-image generation by improving multimodal data efficiency and stabilizing the multimodal training process. ILLUME-X comprises three key components: (i) an expanded training data pipeline optimized for interleaved text-image generation, (ii) a progressive training strategy with self-adaptive objectives for free-length multimodal token sequences, and (iii) an objective and comprehensive evaluation method ILScore for interleaved text-image sequences. Notably, our ILLUME-X outperforms previous unified models across multiple interleaved text-image generation tasks like style transfer, image decomposition and storytelling.

View PDFOpen arXiv