PQuantML: A Tool for End-to-End Hardware-aware Model Compression

2026-03-27Machine Learning

Machine Learning
AI summary

The authors present PQuantML, an open-source software tool that helps make neural networks smaller and faster for devices with strict speed limits. It combines methods to trim unnecessary parts of the model and reduce the precision of numbers used, either separately or together. They tested it on a physics-related task involving real-time data processing and found that PQuantML can shrink models a lot without losing accuracy. The authors also compared it to other tools and showed its effectiveness.

Neural network compressionModel pruningQuantizationFixed-point quantizationHigh-Granularity QuantizationJet taggingLatency constraintsEdge computingOpen-source library
Authors
Roope Niemi, Anastasiia Petrovych, Arghya Ranjan Das, Enrico Lupi, Chang Sun, Dimitrios Danopoulos, Marlon Joshua Helbing, Mia Liu, Sebastian Dittmeier, Michael Kagan, Vladimir Loncar, Maurizio Pierini
Abstract
PQuantML is a new open-source, hardware-aware neural network model compression library tailored to end-to-end workflows. Motivated by the need to deploy performant models to environments with strict latency constraints, PQuantML simplifies training of compressed models by providing a unified interface to apply pruning and quantization, either jointly or individually. The library implements multiple pruning methods with different granularities, as well as fixed-point quantization with support for High-Granularity Quantization. We evaluate PQuantML on representative tasks such as the jet substructure classification, so-called jet tagging, an on-edge problem related to real-time LHC data processing. Using various pruning methods with fixed-point quantization, PQuantML achieves substantial parameter and bit-width reductions while maintaining accuracy. The resulting compression is further compared against existing tools, such as QKeras and HGQ.