Walking in the Implicit: Interactive World Exploration via Neural Scene Representation

2026-06-29Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors propose a new way to create interactive videos where instead of generating each video frame one by one, they represent a scene as a compact, hidden 3D-like state called Neural Implicit Scene (NIS). Their system, NeuWorld, uses this hidden state to predict what the scene looks like from different camera angles over time, making video generation more consistent and efficient. They train their model from scratch using only images with known camera positions, without relying on pre-existing video or 3D models. This approach helps keep the generated videos stable over longer sequences and works well in real-time scenarios.

interactive video generationlatent spaceNeural Implicit Scene (NIS)transformer VAEdiffusion modelcamera trajectorypose-conditioned rendering3D scene representationvideo synthesisstochastic transition
Authors
Zhiqi Li, Chengrui Dong, Zhenhua Du, Hangning Zhou, Cong Qiu, Hailong Qin, Mu Yang, Dongxu Wei, Peidong Liu
Abstract
Interactive video generation systems for camera-controlled world exploration roll out growing sequences of latent video frames, entangling state transition with high-frequency observation synthesis. We propose Walking in the Implicit, a scene-centric paradigm that changes the rollout variable from frame latents to a fixed-length, renderable implicit state, termed Neural Implicit Scene (NIS). This factorizes interactive generation into stochastic transition of a compact scene state and deterministic pose-conditioned rendering given the sampled state. We instantiate this paradigm as NeuWorld: a transformer VAE learns locally anchored NIS from sparse posed frames, and a diffusion transformer evolves NIS conditioned on future camera trajectories and geometry-aware retrieved history. By reusing the VAE encoder as a unified conditioner, NeuWorld maps camera, reference-image, and history cues into the same NIS modality, avoiding external heterogeneous encoders. Trained from scratch on public posed-view data without pretrained video backbones or auxiliary 3D reconstructors, NeuWorld achieves strong long-horizon consistency with favorable inference efficiency.