Structure-Semantic Co-optimized Latent Diffusion Model for Fast Visual Anagram Synthesis
2026-06-15 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors present a new method to create visual anagrams—images that change meaning when flipped or rotated—more quickly and with better image quality than before. They improve existing text-to-image models by combining faster denoising techniques with a new framework called S2CO, which helps keep both the image structure and meaning clear. Their approach produces higher-resolution images that look nicer and make more sense, without requiring a lot of extra computing power. This work aims to make illusionary digital art easier and faster to create.
visual anagramtext-to-image (T2I) diffusion modelsdenoising algorithmlatent-based modelsstructure-semantic co-optimization (S2CO)null-text alignmentsemantic enhancementattention-guided noise fusionimage resolutioninference speed
Authors
Xiang Gao, Yunpeng Jia
Abstract
Visual anagram is an intriguing form of art creation wherein a single image presents different conceptual interpretations under transformations such as flipping or rotation. Recent work has achieved visual anagram synthesis by leveraging pretrained text-to-image (T2I) diffusion models, yet still suffers from several key limitations including computational inefficiency, suboptimal aesthetic quality, and weak semantic fidelity and expressiveness. This work focuses on generating visual anagrams with substantially improved visual quality at minimal computational cost, thereby advancing intelligent creation of illusionary digital art. To increase image resolution while reducing time overhead, we adapt the cutting-edge parallel denoising algorithm from pixel-based T2I model to the adversarially distilled latent-based one, and accordingly propose a structure-semantic co-optimization (S2CO) framework to counteract the consequent visual degradation. As the core of our approach, S2CO framework comprises three key innovations: (\romannumeral1) null-text structure alignment optimization; (\romannumeral2) semantic enhancement optimization; (\romannumeral3) attention-guided noise fusion. Building upon these components, our method dubbed \textbf{S2CO-Anagram} is able to generate higher-resolution anagram images with noticeably superior visual harmony and semantic faithfulness than related SOTA approaches, all while achieving substantially faster inference speed. Code will be publicly available.