Concordia: JIT-Compiled Persistent-Kernel Checkpointing for Fault-Tolerant LLM Inference
2026-06-22 • Distributed, Parallel, and Cluster Computing
Distributed, Parallel, and Cluster ComputingMachine Learning
AI summaryⓘ
The authors address the problem of losing important data stored on GPUs during failures in large language model (LLM) systems. They created Concordia, a runtime that keeps important LLM state directly on the GPU and manages recovery without relying on the main CPU. Concordia adds checkpoints and recovery capabilities at a very low level, inside the GPU execution itself, to minimize lost work and speed up recovery. The system can detect and save only the changed parts of memory, making fault tolerance more efficient for continuous LLM inference.
large language model (LLM)GPUcheckpointingfault tolerancepersistent kernelPTXSASSJIT compilationKV cachedelta checkpoint
Authors
Yuhang Gan, Yiwei Yang, Yuyi Li, Xiangyu Gao, Yichen Wang, Rain Jiang, Xiaoning Ding, Andi Quinn, Chen Qian
Abstract
Long-running LLM agents keep valuable state resident on GPUs: KV caches, request schedulers, communication state, and sometimes online adapters. Losing this state after a GPU or communicator failure can discard minutes to hours of work, yet existing recovery mechanisms either restart the whole serving stack or require application-specific checkpoint logic inside every attention and runtime component. This paper argues that fault tolerance for such workloads needs a GPU-resident execution context: checkpoint hooks must run at device synchronization points, observe binary kernels that frameworks and libraries actually execute, and recover without putting the host CPU on the critical path. We present Concordia, a runtime that uses a device-resident persistent kernel as the substrate for fault-tolerant LLM inference. Concordia interposes on GPU module loading and supports PTX- and SASS-level instrumentation, allowing checkpoint and pause hooks to be inserted below framework code and library boundaries. For each registered LLM state region, Concordia JIT-compiles a specialized delta-checkpoint handler -- for example, a KV-block scanner, adapter-page scanner, or recovery applier -- and hot-swaps it into the persistent kernel's operator table. The persistent kernel consumes a lock-free ring buffer of compute, checkpoint, append-log, and recovery tasks, so the same always-on executor triggers dirty-page detection, stages deltas, and appends committed records to a CPU-visible log in CXL memory or host DRAM.