Intermediate Text Representation Guided Text-to-Image Generation for Enhancing One-and-Only Alignment

2026-06-29 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors study why text-to-image models often struggle to accurately create images of objects that have only one usual appearance, like famous landmarks or artworks. They find that important concept information gets lost in the final text processing step, making it hard to change these images with normal prompts. To fix this, they propose a new method that uses intermediate text representations during the image generation process, which helps recover the missing details without extra training. They also create a new test set to measure how well models handle tricky prompts about these unique objects and show that their approach improves results significantly.

text-to-image diffusion modelsconcept association biasone-and-only (OAO) objectstext embeddingintermediate text representationsdenoising processmutual informationpromptingVQAScorecounterfactual prompts

Authors

Soyoun Won, Aryan Yazdan Parast, Basim Azam, Jean Honorio, Naveed Akhtar

Abstract

Text-to-image (T2I) diffusion models often fail to faithfully render explicit textual descriptions, instead defaulting to strongly learned visual priors due to a phenomenon referred to as concept association bias. We show that such bias is particularly strong for one-and-only (OAO) objects, entities that exist in a single canonical form, such as celestial bodies, landmarks, and artworks. The deeply ingrained visual identity for these concepts often resists modification through prompting alone. Addressing this challenge, we first identify through an information-theoretic analysis that the final text embedding discards concept-level information present in the intermediate-layer text representations, reducing the mutual information available to the subsequent denoising process. We then propose Intermediate Text Representation (IR)-guided diffusion, which injects intermediate hidden states of the text encoder into the conditioning signal during early denoising steps, recovering suppressed concepts without any additional training, optimization, or external models. To systematically evaluate the challenging task of aligning generative outputs with unusual prompts for OAO objects, we introduce OAO-AttackBench, a benchmark comprising counterfactual prompts that directly conflict with the core visual identity of OAO objects. Experiments on four benchmarks, including OAO-AttackBench, show that our method achieves up to a 19.1 percentage-point improvement in VQAScore while preserving generation fidelity and human preference. Project page: https://soyoun-won.github.io/one-and-only-ir-guidance/.

View PDFOpen arXiv