OmniGen-AR: AutoRegressive Any-to-Image Generation

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed OmniGen-AR, a model that can create images from many different types of input, like text, sketches, or depth data, all in one system. They use clever tokenizers to turn various inputs into a shared format so the model can handle them together. To keep the model from mixing up hints and actual image content during training, the authors introduced a special technique called Disentangled Causal Attention. This approach helps the model make better and more flexible images, and it sets new high marks on some standard tests. Overall, their work shows it’s possible to build a single model that generates images from many different directions effectively.

Autoregressive modelsImage generationMulti-modalityVisual tokenizerText tokenizerDisentangled Causal AttentionNext-token predictionConditional inputsSegmentation-to-imageDepth-to-image
Authors
Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang
Abstract
Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.