Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created a new way to turn driving scenes into simple visual tokens that better help a self-driving car understand and plan its actions. Unlike older methods that focus on just making pretty pictures, their approach also includes information about the scene’s shape and movement. They train their system using multiple clues like nearby depth and position changes to make the tokens more meaningful. When tested, their tokens helped improve how well the system rebuilt scenes, stayed consistent, and planned driving steps.
discrete visual tokenstokenizerDINO feature spaceRGB reconstructionperceptual lossadversarial lossdepth supervisionrelative posemulti-codebook quantizationworld modeling
Authors
Ziyang Yao, Zeyu Zhu, YunCheng Jiang, Zibin Guo, Huijing Zhao
Abstract
Discrete visual tokens should provide a compact representation for both token-based world modeling and planning in autonomous driving. However, most tokenizers are inherited from image generation and are optimized mainly for pixel reconstruction, which may leave a gap between what is easy to generate and what is useful to decode for driving decisions. We present a representation-guided and geometry-enhanced tokenizer that learns discrete tokens under joint supervision. The tokenizer aligns its discrete bottleneck with a frozen DINO feature space through feature decoding, while preserving appearance via RGB reconstruction with perceptual and adversarial losses. To inject geometric state-related cues, we add adjacent-frame depth and relative-pose supervision during training and stabilize joint objectives with multi-codebook quantization. We evaluate the same learned tokens with a lightweight planning readout and a GPT-style next-token world model. Experiments on NAVSIM show improved reconstruction fidelity and representation consistency, competitive planning performance under a fixed decoder, and better generative quality under matched settings.