Guiding a Diffusion Model by Swapping Its Tokens

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce Self-Swap Guidance (SSG), a new technique to improve image quality in both conditional and unconditional diffusion models. Their method works by swapping parts of the model's internal token representations to create a perturbed prediction, then uses this to guide the model toward better images. Unlike previous methods that apply changes broadly, SSG selectively swaps token pieces for more precise control, which leads to better image fidelity and prompt matching. They tested SSG on popular datasets and found it performs better and is more stable than earlier approaches. This technique can be added easily to existing models without retraining.

Classifier-Free GuidanceDiffusion ModelsUnconditional GenerationConditional GenerationToken LatentsImage FidelityPrompt AlignmentMS-COCOImageNetSampling

Authors

Weijia Zhang, Yuehao Liu, Shanyan Guan, Wu Ran, Yanhao Ge, Wei Li, Chao Ma

Abstract

Classifier-Free Guidance (CFG) is a widely used inference-time technique to boost the image quality of diffusion models. Yet, its reliance on text conditions prevents its use in unconditional generation. We propose a simple method to enable CFG-like guidance for both conditional and unconditional generation. The key idea is to generate a perturbed prediction via simple token swap operations, and use the direction between it and the clean prediction to steer sampling towards higher-fidelity distributions. In practice, we swap pairs of most semantically dissimilar token latents in either spatial or channel dimensions. Unlike existing methods that apply perturbation in a global or less constrained manner, our approach selectively exchanges and recomposes token latents, allowing finer control over perturbation and its influence on generated samples. Experiments on MS-COCO 2014, MS-COCO 2017, and ImageNet datasets demonstrate that the proposed Self-Swap Guidance (SSG), when applied to popular diffusion models, outperforms previous condition-free methods in image fidelity and prompt alignment under different set-ups. Its fine-grained perturbation granularity also improves robustness, reducing side-effects across a wider range of perturbation strengths. Overall, SSG extends CFG to a broader scope of applications including both conditional and unconditional generation, and can be readily inserted into any diffusion model as a plug-in to gain immediate improvements.

View PDFOpen arXiv