Twelve quick tips for designing AI-driven HPC workflows

2026-06-05Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingArtificial IntelligenceMachine LearningSoftware Engineering
AI summary

The authors explain that traditional supercomputers run simple, predictable tasks, but new AI-based scientific research is different because it involves many repeated steps and uncertain results. They share twelve helpful tips to make these AI workflows work better on big computing systems, focusing on things like making software easy to move, managing many jobs efficiently, and handling lots of small data files. Their advice aims to help researchers build smarter and more flexible computing workspaces, especially for demanding tasks in biology. This guide is meant to help scientists deal with new challenges brought by AI in powerful computer setups.

High-performance computingArtificial intelligenceWorkflowsContainerisationJob arraysI/O optimisationData gravityComputational biologyProbabilistic computingWorkflow orchestration
Authors
Jamie J. Alnasir
Abstract
High-performance computing (HPC) clusters remain the backbone of large-scale scientific computation, traditionally executing deterministic, linear pipelines optimised for predictable performance. However, the pervasive integration of artificial intelligence (AI) and foundation models into scientific research has introduced a fundamentally new computational paradigm. AI-driven workflows are characteristically iterative, data-driven, and probabilistic, introducing unique challenges regarding data gravity, heterogeneous resource management, and complex workflow orchestration. This guide provides twelve practical tips designed to help researchers design efficient, scalable, and reproducible AI-driven HPC workflows. By addressing critical system-level bottlenecks - such as containerisation for environment portability, strategic deployment of job arrays, explicit feedback loop mechanics, and I/O optimisation for small files - this article offers a framework for transitioning from rigid execution pipelines to adaptive, intelligent computational environments. While these architectural principles are broadly applicable across distributed environments, they are particularly tailored to the resource-intensive throughput demands of modern computational biology.