GPU Parallelization Strategies for Forward and Backward Propagation in Shallow Neural Networks: A CUDA-Based Comparative Study

2026-06-29 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingMachine Learning

AI summaryⓘ

The authors studied ways to make neural network calculations faster on NVIDIA GPUs using CUDA. They tested three improvements: organizing memory access better, preparing data to avoid slow memory reads, and combining steps to skip extra memory use. Their best version ran about 1.4 times faster on a large dataset compared to a basic CUDA program. They also compared results to CPU methods, showing that better memory handling helps speed up deep learning tasks on GPUs.

CUDAGPU optimizationshared memorymemory coalescingmatrix multiplicationReLU activationdeep learningTesla T4parallel computingOpenMP

Authors

Rania Zitouni, Nadine Bousdjira, Sarah Hasnaoui, Amel Sadoun, Fatma Salhi

Abstract

We present a comparative study of CUDA optimization strategies applied to forward and backward propagation in a shallow neural network. Three stacked optimizations are evaluated: (1) tiled shared memory with bank-conflict elimination via +1-column padding, (2) pre-transposed weight matrices for coalesced global memory access, and (3) a fused MatMul+ReLU kernel that eliminates intermediate global-memory round-trips. Experiments on an NVIDIA Tesla T4 (CUDA 13.0) across three dataset sizes show that the fully optimized implementation achieves a 1.41x speedup over the baseline CUDA version on the large dataset (25,600 samples), reducing execution time from 21.0s to 14.8s. Results are compared against a sequential CPU baseline and an OpenMP parallel implementation, demonstrating the effectiveness of memory-access optimization in GPU-accelerated deep learning primitives.

View PDFOpen arXiv