PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

2026-03-30Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present PoseDreamer, a new method to create large sets of 3D human images with detailed mesh labels using AI-driven image generation. Instead of relying on real photos with hard-to-get 3D labels or computer-rendered images that look less realistic, their system uses diffusion models and smart filtering to make realistic and diverse images linked correctly to 3D data. They made over 500,000 high-quality images that help train models as well or better than datasets made from real or traditional synthetic data. Their work shows that combining PoseDreamer data with other synthetic datasets improves model performance further. They plan to share their generated dataset and the code behind it.

3D human mesh estimationdepth ambiguitymonocular imagesdiffusion modelssynthetic datasetsimage generationDirect Preference Optimizationcurriculum learningquality filteringcomputer vision datasets
Authors
Lorenza Prospero, Orest Kupyn, Ostap Viniavskyi, João F. Henriques, Christian Rupprecht
Abstract
Acquiring labeled datasets for 3D human mesh estimation is challenging due to depth ambiguities and the inherent difficulty of annotating 3D geometry from monocular images. Existing datasets are either real, with manually annotated 3D geometry and limited scale, or synthetic, rendered from 3D engines that provide precise labels but suffer from limited photorealism, low diversity, and high production costs. In this work, we explore a third path: generated data. We introduce PoseDreamer, a novel pipeline that leverages diffusion models to generate large-scale synthetic datasets with 3D mesh annotations. Our approach combines controllable image generation with Direct Preference Optimization for control alignment, curriculum-based hard sample mining, and multi-stage quality filtering. Together, these components naturally maintain correspondence between 3D labels and generated images, while prioritizing challenging samples to maximize dataset utility. Using PoseDreamer, we generate more than 500,000 high-quality synthetic samples, achieving a 76% improvement in image-quality metrics compared to rendering-based datasets. Models trained on PoseDreamer achieve performance comparable to or superior to those trained on real-world and traditional synthetic datasets. In addition, combining PoseDreamer with synthetic datasets results in better performance than combining real-world and synthetic datasets, demonstrating the complementary nature of our dataset. We will release the full dataset and generation code.