Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present Equilibrated Diffusion, a new method for image customization that better separates different image features like subject content and style by looking at image frequencies. Instead of mixing all features into one embedding, they break the image into low and high frequency parts and optimize each separately, helping the model keep the subject clear while changing styles more accurately. They also use a mask to avoid unwanted background changes and add a special attention module to keep the subject's identity. Their experiments show this approach improves how well the generated images match the subject and the text descriptions compared to other methods.

image customizationfrequency decompositionlatent embeddingdiffusion modelsimage denoisingmask-guided diffusionspatial attentionstyle transfersubject fidelitytext-visual alignment
Authors
Liyuan Ma, Xueji Fang, Guo-Jun Qi
Abstract
Image customization learns target subjects from reference concept images and generates conditioned images per text prompts, mainly modifying styles or backgrounds. Prevailing methods adopt fine-tuning to pack diverse concept attributes into a unified latent embedding, yet entangled attributes hinder elimination of irrelevant disturbances from style and background. To address this issue, we propose Equilibrated Diffusion, a frequency-driven approach that disentangles tangled concept features for balanced customization and consistent text-visual matching. Unlike conventional methods learning full concepts with shared embeddings and unified tuning, our work utilizes the inherent link between image frequency components and semantics: low frequencies represent subject content and high frequencies correspond to styles. We decompose concepts in frequency space and optimize each embedding independently. This separate optimization enables the denoiser to capture style detached from subject identity and generalize better to unseen stylistic prompts. Merging multi-frequency embeddings preserves the model's original spatial customization ability. We further deploy mask-guided diffusion to restrict irrelevant background changes and boost text alignment. Residual Reference Attention (RRA) is inserted into spatial attention to retain subject structure and identity consistency. Experiments prove Equilibrated Diffusion exceeds mainstream baselines on subject fidelity and text adherence, verifying our method's superiority.