Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

2026-06-18 • Machine Learning

Machine LearningDistributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors explore a new way to quickly save and restore the full state of AI models running on devices, like robots or speech systems, which need fast and flexible responses. Instead of just saving parts of the model’s memory (key-value caches), they create 'execution-state capsules' that capture the entire working state so the model can pause, rewind, or branch smoothly. Their system, FlashRT, works efficiently on GPUs by managing these snapshots without much delay, helping speed up tasks significantly. They show this method is especially useful for low-latency situations rather than high-throughput servers.

LLM servingkey-value cacheexecution statecheckpoint and restoreGPU runtimeNVIDIA CUDAlow-latency servingsnapshotFlashRTrecurrent state

Authors

Liang Su

Abstract

Mainstream LLM serving systems reuse prefix work mainly through paged or radix key-value (KV) caches. This is highly effective for high-throughput, high-concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low-latency, small-batch, on-device physical-AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re-enter under tight responsiveness budgets. We introduce execution-state capsules, a graph-bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white-box, backend-facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block-table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token-addressed KV fragments to graph-bound execution-state boundaries. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. A KV-only ablation diverges, showing that recurrent state is load-bearing. GPU-resident snapshot and restore are sub-millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high-throughput KV-cache serving; they define a complementary latency-first serving point for explicit execution-state reuse.

View PDFOpen arXiv