HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

2026-04-20 • Performance

PerformanceDistributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors explore how to efficiently run large language models (LLMs) that need to remember and process very long text sequences, which causes huge memory demands. They propose a method called HybridGen that lets both the CPU and GPU work together using a special type of expanded memory to handle these large memory needs better. Their approach solves key problems related to balancing workload and managing memory layout, resulting in faster performance compared to other existing methods while keeping the model's accuracy intact. They test HybridGen on various models and hardware setups and find it consistently outperforms prior solutions.

Large Language Models (LLMs)Key-Value (KV) CacheCPU-GPU CollaborationExpanded Tiered MemoryCXL MemoryMulti-dimensional AttentionNUMA (Non-Uniform Memory Access)Attention Logit ParallelismSchedulerKV Cache Mapping

Authors

Mao Lin, Xi Wang, Guilherme Cox, Dong Li, Hyeran Jeon

Abstract

As modern LLMs support thousands to millions of tokens, KV caches grow to hundreds of gigabytes, stressing memory capacity and bandwidth. Existing solutions, such as KV cache pruning and offloading, alleviate these but underutilize hardware by relying solely on either GPU or CPU for attention computing, and considering yet limited CPU local memory for KV cache storage. We propose HybridGen, an efficient hybrid attention framework for long-context LLM inference. HybridGen enables CPU-GPU collaborative attention on systems with expanded tiered memory (e.g., CXL memory), addressing three key challenges: (1) multi-dimensional attention dependencies, (2) intensifying CPU-GPU load imbalance with longer sequences, and (3) NUMA penalty of tiered memories. HybridGen tackles these by introducing attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping. Experiments with three LLM models with eleven different sizes on three GPU platforms with a CXL-expanded memory show that HybridGen outperforms six state-of-the-art KV cache management methods by 1.41x--3.2x on average while maintaining superior accuracy.

View PDFOpen arXiv