From the NYU Ultracomputer to Modern Exascale: A Historical and Architectural Survey of In-Network Computing and Scalable Synchronization

2026-06-15Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingHardware ArchitectureGeneral Literature
AI summary

The authors review the development of hardware designs and communication methods used in massively parallel computers over the last 40 years. They study early systems like the NYU Ultracomputer and IBM RP3 that used special hardware to combine operations efficiently, and compare these to later distributed-memory systems like IBM SP and newer network-based models. They also analyze how synchronization works in message-passing systems and how modern deep learning software runs on current hardware. Additionally, the authors explore historical attempts at network node designs and discuss the evolution of software synchronization methods in parallel computing, highlighting differences in engineering approaches between the US and Europe.

massively parallel systemsFetch-and-Addmultistage interconnection networksmessage-passing synchronizationMPIremote memory access (RMA)deep learning hardwareInmos Transputergroup mutual exclusionformal methods
Authors
Lars Warren Ericson
Abstract
This paper presents a historical and technical survey of the hardware architectures, interconnection networks, and synchronization primitives that have shaped massively parallel systems over the past four decades. We examine the design of the NYU Ultracomputer and the IBM Research Parallel Processor Prototype (RP3), focusing on the hardware implementation of the Fetch-and-Add primitive in multistage interconnection networks. We contrast these early attempts at fine-grained, shared-memory hardware combining with the distributed-memory architectures of the IBM SP series and the modern in-network computation models found in NVIDIA SHARP and HPE Slingshot. We provide a technical analysis of message-passing synchronization, presenting a complete profiling of MPI operation frequencies and detailing the low-level hardware mapping of one-sided RMA atomics to PCIe Atomics and GPU caches. We investigate the software-hardware boundary in modern deep learning, detailing how HIP translation, Triton compilation, and 4-bit quantization (W4A16) execute on modern heterogeneous silicon. To evaluate alternative network node designs, we present a historical hardware case study analyzing the feasibility of implementing active combining switches using message-passing Inmos Transputers programmed in Occam. Finally, we contextualize the evolution of concurrent software synchronization by examining Isaac Dimitrovsky's parallel "group lock" primitive, tracing its downstream echoes in group mutual exclusion (GME) and room synchronization, and reflect on the historical, philosophical divide between American systems engineering and European formal methods.