Grounding World Simulation Models in a Real-World Metropolis

2026-03-16Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created Seoul World Model (SWM), a computer model that can generate videos of a real city—Seoul—by using actual street-view images to guide the video creation. Unlike earlier models that imagined cities, SWM uses nearby photos to make videos look realistic and match the real place. They solved problems like timing mismatches and limited camera views by using techniques like pairing images from different times, adding synthetic data for variety, and blending views smoothly. They also developed a method to keep long videos stable by constantly referring to future locations. When tested in Seoul, Busan, and Ann Arbor, their model made videos that are more accurate and consistent over long distances compared to past methods.

world modelautoregressive video generationstreet-view imagesretrieval-augmented conditioningtemporal misalignmenttrajectory diversitysynthetic datasetview interpolationVirtual Lookahead Sinkurban environment modeling
Authors
Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim
Abstract
What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.