PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

2026-07-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors show that you don't need complicated models or tricky math to turn a single image into a 3D shape. They built a simpler system using a type of transformer that works directly with raw 3D points and uses features from a pre-trained image model. Their method is trained from scratch and avoids complex steps other methods use. Despite its simplicity, it performs better at creating sharp and clear 3D shapes, especially for tricky objects like transparent ones.

3D reconstructionDiffusion TransformerViT (Vision Transformer)DINOv3Latent diffusionPoint mapSingle-image reconstructionGeometric structureTransparent objects

Authors

Haofei Xu, Rundi Wu, Philipp Henzler, Nikolai Kalischek, Michael Oechsle, Fabian Manhardt, Marc Pollefeys, Andreas Geiger, Federico Tombari, Michael Niemeyer

Abstract

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.

View PDFOpen arXiv