COSM: A Cooperative Scheduling Framework for Concurrent PIM and CPU Execution on Mobile Devices

2026-06-29 • Hardware Architecture

Hardware ArchitectureDistributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors designed COSM, a system that helps a phone's CPU and a special memory chip called Processing-in-Memory (PIM) work together smoothly while running large language models (LLMs). Their approach reduces delays and energy use by scheduling PIM operations during CPU idle times, avoiding conflicts in memory access. Tests show COSM significantly speeds up PIM tasks with almost no slowdown for the CPU. This helps make on-device language models more efficient and private.

Processing-in-Memorylarge language modelsCPU schedulingmemory bandwidthDRAMbank conflictslatency hidingmobile computingthroughput

Authors

Yilong Zhao, Fangxin Liu, Onur Mutlu, Mingyu Gao, Jian Liu, Haibing Guan, Li Jiang

Abstract

The development of on-device large language models (LLMs) is driven by the need for privacy and fast response times. Energy-intensive data transfer on mobile devices makes Processing-in-Memory (PIM) an effective solution. Due to stringent DRAM cost constraints, limited physical footprint on circuit boards, and the interaction between applications and LLMs, it is imperative for the CPU and PIM to operate concurrently within a shared memory space. However, challenges such as bank conflicts and bus congestion can arise, potentially diminishing the performance and energy benefits of PIM. To address this challenge, we introduce COSM, a cooperative scheduling framework designed to facilitate the concurrent operation of PIM and CPU tasks on mobile platforms. Our key innovations include: 1) a low-interference PIM control interface that generates the maximum number of PIM commands without disrupting CPU memory accesses; 2) an idleness-aware scheduling method that integrates PIM commands into available idle time windows within the CPU's access sequence. COSM not only hides PIM execution latency from the CPU, but also overlaps PIM execution with data transfer. Experiments on concurrent execution of LLMs and mobile workloads, including mobile applications and compute-intensive kernels, demonstrate that COSM improves PIM throughput by up to 2.8x compared to the baseline scheduling method with less than 2.0% CPU performance loss.

View PDFOpen arXiv