Network-Efficient World Model Token Streaming
2026-05-11 • Robotics
Robotics
AI summaryⓘ
The authors study how to efficiently stream compact digital representations of driving scenes across connected vehicles with limited network bandwidth. They propose an adaptive method that sends keyframes and smaller updates (deltas) based on changes measured in the token embedding space, improving image quality compared to fixed-interval keyframes. This approach also maintains quality better under packet loss and improves the performance of a model predicting future tokens from the streamed data. Their work shows that smart streaming of discrete tokens can help synchronize driving data effectively over limited vehicle networks.
generative driving world modelslatent state representationVQ-U-Net tokenizercodebook embedding spacekeyframe-delta protocolcosine distancerate distortionpacket lossnext-token predictionperplexity
Authors
Shatadal Mishra, Ahmadreza Moradipari, Nejib Ammar
Abstract
Generative driving world models rely on compact latent state representations that must be efficiently transmitted and synchronized across distributed compute and connected vehicles. We study network-efficient streaming of a discrete world model state, where a stride-16 VQ-U-Net tokenizer (codebook size 8,192) maps each 288x512 frame to an 18x32 grid of token IDs (576 tokens/frame), equivalent to 936 bytes/frame under fixed-length coding. We consider a keyframe--delta protocol under strict per-message payload budgets and packet loss, and propose a fully online, label-free algorithm that prioritizes delta updates via cosine distance in codebook embedding space and triggers keyframes adaptively using a Hamming-drift threshold. The adaptive algorithm consistently improves the rate distortion frontier over periodic keyframes at matched bitrates: at 0.024 Mb/s (200-byte budget) dynamic-only embedding distortion drops from 0.0712 to 0.0661 (7.2\%), and at 0.036 Mb/s (400-byte budget) from 0.0427 to 0.0407 (4.8\%). Under 10\% delta packet loss at 200 bytes, dynamic-only distortion is 0.0757 versus 0.0789 for a matched periodic baseline. To connect state fidelity to world model usefulness, we train a lightweight next-token predictor and evaluate perplexity conditioned on streamed receiver states: at 0.024 Mb/s, dynamic-position perplexity improves from 206.0 to 193.1 (6.3\%), and at 0.036 Mb/s from 158.9 to 155.6 (2.1\%). These results support discrete token-state streaming as a practical systems layer for bandwidth-aware synchronization and improved downstream token-dynamics utility under vehicular networking constraints.