Atompack: A Storage and Distribution Layer for Read-Heavy Atomistic ML Training Datasets

2026-06-29Machine Learning

Machine Learning
AI summary

The authors designed Atompack, a new way to store and share molecular datasets used for training machine learning models. Unlike other systems that focus on editing data or random access, Atompack is optimized for reading whole molecular records in random order, which suits training needs better. They showed that Atompack is much faster and creates smaller files compared to popular storage formats like HDF5 and LMDB. This suggests that organizing data as complete molecules instead of smaller parts makes training more efficient and keeps files easy to share.

Atomistic datasetsMachine learning trainingImmutable storageMemory-mapped filesHDF5LMDBData shufflingAppend-only storageMolecular records
Authors
Ali Ramlaoui, Daniel T. Speckhard, Sagar Pal, Fragkiskos D. Malliaros, Alexandre Duval, Victor Schmidt
Abstract
Atomistic machine learning datasets are increasingly used for training: large immutable snapshots are read repeatedly, shuffled across epochs, staged across clusters' storage systems, and republished as reusable scientific artifacts. This workload differs from interactive scientific curation, where mutable records and ad hoc inspection are often more important than random indexed throughput. We present Atompack, an append-oriented storage format and distribution layer designed around a simple workload: training pipelines usually consume complete molecular records, while the order of records is randomized by the learning algorithm. Atompack appends records efficiently during dataset construction, then commits an immutable index and serves records through a memory-mapped read path optimized for training. We compare Atompack with HDF5, LMDB, and ASE baselines representing array stores, key-value records, serialized records, and object-oriented databases. The benchmarks measure sequential reads, shuffled reads, shared-filesystem behavior, write throughput, and artifact size. On a representative 64-atom workload, Atompack is 96x faster than ASE LMDB on shuffled training-style reads while producing artifacts about 79\% smaller. The results indicate that serving complete molecule records, rather than field chunks or reconstructed objects, improves shuffled training throughput while keeping artifacts compact enough for public distribution.