NeuronFabric: A Software Reference Architecture for On-Chip Transformer Training with Local Adam
2026-06-15 • Hardware Architecture
Hardware ArchitectureArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors introduce NeuronFabric, a software design for training transformer models using Adam optimization directly on hardware like FPGAs or ASICs, without relying on external tools. They created a C# prototype that runs a small transformer model on the Shakespeare dataset to verify accuracy and memory use. A key idea is BF16W, which saves memory by storing weights in a smaller format but keeps optimizer data in full precision. Their results show memory savings that fit on an FPGA device's built-in memory, preparing for future hardware implementation. The paper shares this architecture and code as a foundation for future hardware training research.
transformer modelsAdam optimizerFPGAASICbackpropagationBF16mixed precisionmemory optimizationautoregressive modelShakespeare dataset
Authors
Evgeny Ukladchikov
Abstract
Publicly documented accelerator architectures generally separate training computation from optimizer-state updates or rely on external memory and host orchestration. This paper presents NeuronFabric, a software reference architecture intended for future FPGA and ASIC implementations of transformer training with local Adam updates. A complete C# prototype implements forward pass, backpropagation, and Adam optimization without external machine-learning frameworks. The goal is to validate numerical correctness and memory requirements before hardware implementation. The evaluated model is a 334K-parameter autoregressive transformer (d=88, H=4, f=264, L=4, vocab=256) trained on the Shakespeare corpus. The BF16W configuration achieves evaluation loss 1.5426 after 80K samples, compared with 1.5224 for an FP32 GPU reference, while producing coherent character-level text. The paper introduces BF16W, which stores weights in BF16 while retaining Adam optimizer moments in FP32. This reduces memory requirements for on-chip training. A 334K-parameter FP32 model with Adam moments requires approximately 4.0 MB, matching the BRAM capacity of a Xilinx ZCU102 device. The BF16W variant requires approximately 3.34 MB, leaving memory available for activation storage. We describe the vocabulary-budget constraint observed during earlier experiments, quantify BF16W memory savings, and outline FPGA training as the next stage of development. No FPGA measurements are included in this paper. This publication serves as a public architectural disclosure and software reference implementation for future FPGA and ASIC exploration of the NeuronFabric architecture.