ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution

2026-06-02Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing
AI summary

The authors identify a problem with traditional LZ77 compression decoding, which has to work step-by-step, making it hard to speed up using multiple processors. They introduce ACEAPEX, a new method that breaks the data into independent parts and uses absolute positions to allow multiple parts to be decoded at the same time, greatly improving speed. They also create a version that runs on powerful GPUs, achieving very high speeds while keeping almost the same compression quality. Their work is the first to show a GPU-based LZ77 decoder with verified correctness and nearly standard compression ratios.

LZ77 compressionback-referenceparallel decodingabsolute positionsblock-level decodingGPU decodingwavefront decodercompression ratiolzbenchFASTQ genomic data
Authors
Yakiv Shavidze
Abstract
LZ77-based codecs exhibit a fundamental sequential bottleneck in decoding: each back-reference depends on previously decompressed data, preventing multi-core scaling. We present ACEAPEX, a parallel LZ77 codec that stores all back-references as absolute positions in the decompressed output and organizes data into self-contained 1 MB blocks, enabling embarrassingly parallel block-level decoding. Integrated into lzbench, ACEAPEX achieves 10,160 MB/s on EPYC 4344P (8 cores) and 10,869 MB/s on EPYC 9575F for FASTQ genomic data -- up to 3.1x faster than zstd -3 at comparable compression ratios. We further implement a GPU wavefront decoder on NVIDIA H100 SXM, measuring 44.0 GB/s on enwik9 and 20.3 GB/s on FASTQ (wavefront match phase, BIT-PERFECT verified). With a depth-limited encoder variant (-1.5% ratio on enwik9), GPU throughput reaches 77.2 GB/s on a single H100 and 249.9 GB/s on two H100s in NVLink configuration. To our knowledge, this is the first reported GPU LZ77 decode with near-standard compression ratio verified byte-for-byte.