FLUX3D: High-Fidelity 3D Gaussian Generation with Diffusion-Aligned Sparse Representation

2026-06-23 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors identify two main problems in current methods that turn images into 3D objects using sparse voxel representations: losing fine visual details and difficulty matching 2D images with 3D voxels. They propose FLUX3D, a new approach that improves how features are chosen and how the image and 3D data are aligned during generation. Their method uses special latent features and a transformer designed to handle sparse 3D structures, which helps keep the details clearer. Tests show that FLUX3D creates better-looking 3D models than previous methods.

Sparse Voxel Representation3D Gaussian SplattingDiffusion TransformersLatent FeaturesCross-modal AlignmentDecoder-only ArchitecturePositional EmbeddingMultimodal Learning3D ReconstructionImage-to-3D Generation

Authors

Haorui Ji, Weizhe Liu, Hongdong Li, Hengkai Guo

Abstract

Sparse voxel representation has emerged as a scalable foundation for image-to-3D Gaussian Splatting (3DGS) generation, yet current methods struggle to preserve high-frequency visual details of input images due to two structural bottlenecks. First, they adopt discriminative 2D features optimized for semantic abstraction to construct sparse voxel latents, which suppress reconstructive cues and induce a representation bottleneck. Second, in the generation stage, standard diffusion transformers lack effective mechanisms to align dense 2D image tokens with sparse 3D voxel latents, resulting in a cross-modal correspondence bottleneck. To address these issues, we propose FLUX3D, a scalable image-to-3DGS framework that boosts both representation learning and cross-modal alignment during generation. We first revisit 2D feature selection for sparse-voxel-based 3D representation learning, propose Diffusion-Aligned Structured Latents (DA-SLAT) and couple it with a decoder-only architecture to improve 3DGS reconstruction fidelity. We also design a sparse-structure-aware diffusion framework, which integrates the Sparse-structure Multimodal Diffusion Transformer (SMDiT) and Modal-Aware Rotary Positional Embedding (MARoPE) to achieve geometry-agnostic 2D-3D alignment. Extensive benchmark experiments demonstrate that FLUX3D yields substantial improvements in appearance fidelity and significantly outperforms all state-of-the-art (SOTA) methods in generating high-quality 3DGS assets.

View PDFOpen arXiv