Redirecting the Flow: Image Customization through Attention Distribution Shift

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address the problem of customizing images so they match both a text description and a given reference subject, like a particular person or object. They point out that existing methods struggle with efficiency and accurately matching the reference features during image generation. To fix this, they treat the problem as a shift in how the model pays attention to parts of the input, and propose a new approach called CustomShift with two connected branches to better align reference images and text cues. Their tests show that CustomShift produces images that better keep the subject's identity while matching the text.

Subject-driven image customizationText-to-image generationStable DiffusionAttention mechanismSelf-attentionMaximum entropy theoryDistribution shiftDreamBoothSemantic fidelityLatent representations

Authors

Jie Li, Suorong Yang, Jian Zhao, Furao Shen

Abstract

Subject-driven image customization aims to generate images that not only follow textual instructions but also preserve the identity of a given reference subject. Existing approaches, including test-time fine-tuning, encoder-based methods, and token competition in shared attention spaces, suffer from limited efficiency, misalignment between extracted reference features and the generative process, and interference from irrelevant information. To address these limitations, we formulate the customization task as a distribution shift induced by incorporating reference images into text-to-image generation, and derive a Conditional Attention Distribution Shift formulation grounded in maximum entropy theory. Building on this formulation, we propose CustomShift, a dual-branch architecture based on Stable Diffusion 3. The Reference-Alignment Branch leverages self-attention between reference images and subject names to achieve layer-wise alignment with latent representations, while the Cross-Guidance Branch integrates textual and reference cues to guide generation. Experiments on the DreamBooth and Custom101 benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, achieving a better balance between semantic fidelity and subject consistency.

View PDFOpen arXiv