Learned Image Compression for Vision-Language-Action Models

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors created SPARC, a new way to compress images for robots that use vision and language to make decisions. They found that some parts of images and certain camera views are more important for the robot's tasks, so SPARC focuses its compression on these key areas to save bandwidth. They also added techniques to help the system learn better without losing rare but important details. Their tests show SPARC helps robots control better than regular image compression, especially when internet bandwidth is limited.

vision-language-action modelsimage compressionbitrate allocationlatent representationsrobotic controltemporal maskingentropy lossmulti-camera systemsreal-time controlvideo codecs

Authors

Hyeonjun Kim, Jegwang Ryu, Sangbeom Ha, Junhyeok Lee, Jun-Hyuk Kim, Hyemin Ahn, Jaeho Lee

Abstract

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

View PDFOpen arXiv