ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks
2026-05-25 • Machine Learning
Machine Learning
AI summaryⓘ
The authors created ViroBench, the first big test set specially made to check how well computer models understand viral genetic material. They tested 66 different models and found that many struggle to predict well when the virus changes over time or belongs to a new family. They also discovered that models guessing likely DNA sequences do not always produce biologically useful results, which could be risky. Their study shows that having a variety of viral data during training is more important than just making models bigger. ViroBench aims to help others safely and reliably improve viral genome analysis.
nucleotide sequencesviral genomicsfoundation modelsphylogenetic shiftbiosecurity risklatent functional validitytaxonomic diversitypretraining databenchmark datasetmodel extrapolation
Authors
Dongxin Ye, Fang Hu, Han Hu, Shu Hu, Yang Tan, Wanli Ouyang, Stan Z. Li, Jie Cui, Nanqing Dong
Abstract
Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at https://github.com/QIANJINYDX/ViroBench.