CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

2026-04-21 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created CityRAG, a tool that can generate long, realistic videos of actual places that you can navigate through, like a virtual tour. Unlike other models, it can handle changes like different weather or moving objects by learning from lots of real-world location data. CityRAG can keep the scene looking consistent over time, such as maintaining the same lighting and weather for thousands of frames, and can even loop back to the starting point accurately. This makes it useful for things like self-driving car simulations or robot training in the real world.

3D-consistent video generationGeo-registered dataScene groundingTemporal disentanglementWeather simulationLoop closureAutonomous driving simulationRobotics simulationVideo generative modelsSpatial navigation

Authors

Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng, Bharath Hariharan, Jason Y. Zhang, Noah Snavely, Philipp Henzler

Abstract

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

View PDFOpen arXiv