Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionGraphics
AI summary

The authors developed StreetNVS, a method to create realistic video views of driving scenes from different angles using data from multiple sensors on vehicles, like LiDAR and cameras. Their system smartly combines detailed but limited LiDAR depth data with rich camera images and precise vehicle positions to better recreate scenes, even when the viewpoints differ greatly from the original path. They trained the model in stages to handle less LiDAR data effectively. Their experiments showed it works better than previous approaches, even when using much less LiDAR information, and can make coherent videos from unusual viewing angles.

LiDARmulti-camera rigsvideo diffusion modelssensor fusionego-motionnovel view synthesispositional encodingWaymo Open Datasetcurriculum trainingreference imagery
Authors
Zhengfei Kuang, Adam Sun, Liyuan Zhu, Tong Wu, Shengqu Cai, Jonathan Tremblay, Iro Armeni, Ehsan Adeli, Lior Yariv, Gordon Wetzstein
Abstract
Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation. Our website: https://streetnvs.github.io