LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

2026-03-18Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present LaDe, a system that creates detailed, layered design documents like posters or logos just from text descriptions without limiting the number of layers or forcing layers to be continuous in space. LaDe uses three parts: a language model that breaks down a simple prompt into detailed layer instructions, a special diffusion model that generates both the whole design and its separate layers at once, and a tool that accurately decodes each layer with transparency. Their method can handle making images from text, making layered designs from text, and breaking down designs into layers. They show that LaDe performs better than a previous system by matching text instructions to layers more closely on a design test set.

Latent DiffusionLayered Media DesignRGBA LayersLarge Language Model (LLM)Positional EncodingVAE (Variational Autoencoder)Text-to-Image GenerationDesign DecompositionMultimodal Learning
Authors
Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu
Abstract
Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).