MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors introduce MMDiff, a method that takes a diffusion transformer—a model that creates images step-by-step—and reuses its internal information to generate not only images but also related data like depth maps or object segments. They found that combining features from multiple steps in the image creation process improves tasks like semantic segmentation a lot. By adding small, specialized parts called decoder heads to the unchanged main model, they can get good results in different vision tasks without retraining the whole system. This approach also helps produce synthetic training data effectively.

Diffusion transformerDenoising trajectoryMulti-modal generationSemantic segmentationDecoder headFeature fusionConcept-driven attentionSynthetic data generationDepth estimationSalient object detection
Authors
Yagmur Akarken, Orest Kupyn, Christian Rupprecht
Abstract
Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.