Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceGraphicsMachine LearningMultimedia
AI summary

The authors address the challenge of creating new images that keep the identity of a subject while following text instructions. They identify issues with previous methods that handle text and images separately, causing problems like copy-paste artifacts and weak understanding. To fix this, they use a combined model that processes both text and images together and add a way to better preserve identity details. They also introduce a special module and a step-by-step method to improve how these signals work together. Their experiments show that their method produces images that better balance following instructions and keeping identity.

subject-driven image generationdiffusion modelsMultimodal Large Language Models (MLLMs)variational autoencoders (VAE)identity preservationcross-modal reasoningDual Layer Aggregation (DLA)multi-stage denoisingcopy-paste artifacts
Authors
Shuhong Zheng, Aashish Kumar Misraa, Yu-Teng Li, Yu-Jhe Li, Igor Gilitschenski
Abstract
Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instructions. Existing approaches often encode text and reference images separately. This limits cross-modal reasoning abilities and causes copy-paste artifacts. Recent frameworks that connect multimodal models and diffusion models improve instruction following, but largely overlook identity preservation. To address these limitations, we condition diffusion models on Multimodal Large Language Models (MLLMs) that jointly encode text and reference images, and augment it with VAE-based identity conditioning. A novel Dual Layer Aggregation (DLA) module is designed to aggregate multi-level MLLM features for optimal conditioning, and a multi-stage denoising strategy is applied to progressively balance the semantic information from MLLM and fine-detail identity from VAE during inference. Extensive experiments demonstrate that our approach harmonizes multimodal understanding with identity preservation, mitigates copy-paste issues, and achieves superior performance regarding human preference on subject-driven image generation. Our project website is available at https://zsh2000.github.io/squeeze-mllm-subject-gen/.