One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

2026-03-24Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present OVIE, a method that can create new views of a scene from just one image, without needing pairs of images taken from different angles. They train their model using millions of single images from the internet by estimating depth to create 3D-like views during training but only use the single input image at test time. To deal with parts of the image that get uncovered when changing viewpoint, they mask losses to focus on valid areas. Their approach works well without extra geometry at inference and is much faster than previous methods.

monocular novel-view synthesisdepth estimation3D projectiondisocclusionmasked trainingzero-shot learninginternet imagesgeometric transformation
Authors
Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard
Abstract
Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.