Beyond CPU-GPU Frequency: Memory-Clock and Tail Effects in Edge Inference Latency Estimation

2026-06-15 • Performance

PerformanceHardware ArchitectureDistributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors studied how changing CPU, GPU, and memory speeds affects the time it takes to run machine learning tasks on an NVIDIA Jetson Orin Nano. They found that memory speed significantly impacts latency but is often ignored in models. They also discovered that occasional bursts of memory misses happen more often than random chance predicts, making deadlines harder to meet. Additionally, switching frequencies isn't instant and can cause delays that affect timing predictions. These findings suggest current latency estimators need improvement to better handle memory effects and transition delays.

DVFSlatency estimationedge ML inferenceNVIDIA Jetson Orinmemory clockmiss ratefrequency scalingGPUCPUGeneralized Pareto Distribution

Authors

Jaehoon Kang

Abstract

Frequency-aware latency estimators enable deadline-aware DVFS for edge ML inference by modeling latency over CPU and GPU frequencies. We present a measurement study on an NVIDIA Jetson Orin Nano showing three phenomena outside this modeling scope. (1) The memory clock is a missing axis: across the realistic upper EMC range (2133->3199 MHz) it shifts median latency by +11% to +48% depending on workload, and for a synthetic L2-resident kernel at the top GPU clock we observe a reproducible non-monotonic case (-9%). A GPU-frequency estimator profiled under one power profile and deployed under another consequently underestimates latency by up to 32%; tabulating the four lockable EMC points repairs most workloads, while a parametric 1/f_emc term does not. (2) Aggregate miss rates hide bursts: at fixed clocks, 100k-cycle runs show knife-edge distributions whose deadline-miss cliffs span ~1 ms, yet misses cluster far beyond independence - at a 0.1% aggregate miss rate, the next cycle also misses with probability up to 74% (740x the independent baseline). Gaussian mu+3sigma margins overshoot a 0.1% miss target by 13x-29x, while out-of-sample generalized Pareto margins stay within ~2x of it across all eight configurations. (3) Frequency actuation is not free: per-domain transition stalls stay below 100 us, but the new operating point takes 1/5/8 ms (CPU/GPU/EMC) to take effect - a substantial fraction of typical inference periods for per-inference governors. We release the full measurement harness and discuss implications for the next generation of frequency-aware estimators and governors.

View PDFOpen arXiv