Planning-aligned Token Compression for Long-Context Autonomous Driving
2026-06-05 • Robotics
RoboticsArtificial IntelligenceComputer Vision and Pattern Recognition
AI summaryⓘ
The authors address a challenge in autonomous driving models that process long sequences of information, which can slow down decision-making. They introduce COMPACT-VA, a method that compresses important past information in a way that keeps key details needed for safe driving decisions. This approach uses a special encoding aligned with the vehicle's planned actions to ensure critical context is preserved. Tests show their method improves success rates and makes the system faster and more memory-efficient without losing driving quality.
Autonomous drivingVision-action modelsToken compressionVQ-VAEPlanning intentTrajectory predictionEnd-to-end optimizationTemporal contextClosed-loop evaluation
Authors
Zhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus, Wenhao Ding, Boyi Li, Alexander Popov, Yan Wang, Maximilian Igl, Yiming Li, Danfei Xu, Nikolai Smolyanskiy, Boris Ivanovic, Ping Luo, Marco Pavone
Abstract
Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.