GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

2026-03-30Machine Learning

Machine LearningDistributed, Parallel, and Cluster Computing
AI summary

The authors created a fast system that uses GPUs to run transformer models like BERT and GPT-2 much quicker than traditional CPUs, cutting memory use too. They developed a smart method that uses higher precision calculations only where absolutely needed, keeping the rest lower precision to speed things up without losing accuracy or causing errors. Their setup is modular and reproducible, and tests show it works consistently on advanced NVIDIA GPUs without hurting the model's performance. They also found that some shortcut testing methods can miss stability problems, but their careful approach avoids these issues.

GPU accelerationTransformer modelsMixed-precision optimizationNVIDIA TensorRTBERTGPT-2LatencyNumerical stabilitySoftmaxLayer normalization
Authors
Soutrik Mukherjee, Sangwhan Cha
Abstract
This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360 configurations. Cross-GPU validation on an NVIDIA A100 shows consistent FP16 speedup ratios between 1.84x and 2.00x, along with stable numerical behavior. Downstream evaluation on SST-2 demonstrates no accuracy degradation under hybrid precision. Validation on WikiText-2 shows that random inputs underestimate NaN instability by up to 6x for full FP16, while confirming the robustness of the hybrid approach (0.0 percent NaN, cosine similarity >= 0.9998). These results provide a detailed characterization of performance and accuracy trade-offs across GPU architectures and offer practical guidance for deploying transformer models in latency-critical environments.