C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

2026-06-08Machine Learning

Machine LearningComputer Vision and Pattern RecognitionRobotics
AI summary

The authors studied a way to make robot decision-making models called World Action Models (WAMs) faster. These models usually take a lot of time because they repeatedly do a slow denoising step while watching video chunks. The authors found that similar calculations happen across these chunks during smooth robot movements. They introduced C³ache, a method that saves and reuses these repeated calculations, speeding up the process without hurting accuracy. Their tests showed this method made the models run more than twice as fast with almost no loss in success.

World Action ModelsVision-Language-Action policiesvideo-modelingdenoising processinference chunkscachingresidualsrobot behaviorFast-WAMtask success rate
Authors
Weisen Zhao, Lam Nguyen, Zhicong Lu, Yuzhang Shang
Abstract
World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.