AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis
2026-05-25 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address a problem in text-to-image generation where the model's attention to the words doesn't line up well with the image details. They identify that the model spreads attention too broadly for the same subject and mixes attention between different subjects. To fix this, they introduce two new losses: one that gathers scattered attention for each subject and another that keeps different subjects' attention separate. Their method, called AI-T2I, improves how well images match the text and works well on other related tasks too.
text-to-image synthesisdiffusion modelscross-attentiondenoising processattention mechanismsaggregation lossisolation losstext-to-image alignmentgenerative modelslayout generation
Authors
Shipeng Cao, Biao Qian, Haipeng Liu, Yang Wang, Meng Wang
Abstract
Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primarily focus on inter-subject-token activations (i.e., cross-attention scores) overlap for different subjects, overlooking the intra-subject-token activations scattering issue for identical subjects. In this paper, we propose an Aggregating-and-Isolating cross-attention approach to diffusion models for Text-to-Image synthesis, dubbed AI-T2I. Technically, to address the scattering issue, we devise an aggregation loss to identify and consolidate the scattered intra-token activations, which implicitly helps mitigate the potential overlap issue. Upon that, an isolation loss is further introduced to push the inter-token activations apart, thus fulfilling precise text-to-image alignment. Extensive experiments on various benchmarks demonstrate the superiority of AI-T2I over the state-of-the-art works for text-to-image synthesis. Furthermore, our AI-T2I exhibits excellent generalization across other tasks, e.g., controllable layout generation and personalized generation.