LiTo: Surface Light Field Tokenization
2026-03-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceGraphics
AI summaryⓘ
The authors created a new way to represent 3D objects that captures both their shape and how they look from different angles, including shiny and reflective effects. They use pictures that include depth information to learn how light interacts with surfaces. Their method stores this information in a compact 3D space, allowing realistic rendering from various viewpoints. They also train a model to generate these 3D representations from just one image, keeping the object's appearance consistent with the original lighting. Their tests show their method produces better quality and matches the input images more closely than previous approaches.
3D latent representationview-dependent appearancesurface light fieldspecular highlightsFresnel reflectionsRGB-depth imageslatent flow matchinggeometry reconstructionimage-conditioned generation
Authors
Jen-Hao Rick Chang, Xiaoming Zhao, Dorian Chan, Oncel Tuzel
Abstract
We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.