ERNIE-Image Technical Report

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMachine Learning

AI summaryⓘ

The authors present ERNIE-Image, an open-source model that creates images from text descriptions using a special 8 billion parameter architecture. They improve training by carefully selecting and organizing large amounts of image and caption data to reduce errors while keeping detailed and rare concepts. After initial training, they fine-tune the model to better follow user instructions and make images look nicer based on human preferences. They also developed tools to help users write better prompts and created a benchmark to evaluate image aesthetics. Their tests show ERNIE-Image performs very well compared to other open models and nearly matches commercial systems.

text-to-image generationDiT architecturepre-trainingdata construction pipelineinstruction followingdistillationprompt engineeringaesthetic evaluationopen-source AIfine-tuning

Authors

Jiaxiang Liu, Zhida Feng, Pengyu Zou, Zhenyu Qian, Tianrui Zhu, Jun Xia, Yuehu Dong, Yanzheng Lin, Honglin Xiong, Anqi Chen, Yunpeng Ding, Jinghui Duan, Lin Gao, Chao Han, Tiechao He, Jiakang Hu, Ranjun Hua, Xueming Jiang, Qingli Kong, Yuting Lei, Tianyu Li, Yunlin Liu, Changling Liu, Yaxin Liu, Yi Liu, Xuguang Liu, Xiaolong Ma, Yan Pan, Yiran Ren, Nan Sheng, Yu Sun, Siyang Sun, Yixiang Tu, Yang Wan, Huanai Wang, Siqi Wang, Yang Wu, Youzhi Yang, Xiaowen Yang, Jianwen Yang, Yehua Yang, Quanwen Zhang, Xinmin Zhang, Haoxin Zhang, Xiang Zhang, Jun Zhang, Qian Zhang, Qiao Zhao, Qi Zhou

Abstract

We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.

View PDFOpen arXiv