RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

2026-06-12Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors study how large language models (LLMs), which are good at understanding text, can help improve text-to-image systems that generate pictures from words. Usually, the picture-making part is handled by new models trained from scratch, but the authors show that using multimodal LLMs (MLLMs), which understand both text and images, can better guide the image generation process. They created RepFusion, a method that uses MLLMs to help clean up noisy visual data during image creation, leading to better results compared to typical approaches with similar computing resources. Their work suggests that MLLMs serve as strong guides for generating images and that repeatedly using them with evolving noisy images improves performance.

Large Language ModelsText-to-Image GenerationRepresentation AutoencodersMultimodal Large Language ModelsDiffusion ModelsDenoisingLatent SpaceMLP ProjectorVisual RepresentationsTest-time Computation
Authors
Xichen Pan, Aashu Singh, Satya Narayan Shukla, Xiangjun Fan, Shlok Kumar Mishra, Saining Xie
Abstract
Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.