DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks
2026-06-29 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied different types of AI models used to understand DNA sequences, focusing on newer transformer models and older convolutional models. They wanted to see if the extra time and effort needed to train transformer models really leads to better results. They also looked at how a specific way of breaking down DNA sequences, called BPE tokenization, affects the model's performance. Their work aims to clarify which methods are worth using for tasks involving DNA data.
foundation modelsLarge Language Models (LLMs)transformer modelsconvolutional modelspretrainingfine-tuningByte Pair Encoding (BPE)DNA sequence representationgenomics tasksbenchmark evaluation
Authors
Romain Karpinsky, Julien Mozziconacci, Mickaël Delcey
Abstract
Recent breakthroughs in foundation models and Large Language Models (LLMs) have introduced new opportunities for studying and decoding genomic sequences. Several state-of-the-art approaches, such as DNABERT2, rely on transformer-based architectures, while others, such as ConvNova, still build upon more conventional convolutional models. However, systematic benchmark comparisons across these methods remain scarce. Given that transformer-based models require extensive and costly pretraining, it is crucial to evaluate whether their performance gains justify this overhead. Moreover, LLMs such as DNABERT2 typically rely on Byte Pair Encoding (BPE) tokenization, whose relevance for DNA sequence representation is still debated within the genomics community. In this work, we investigate three key questions: (i) do transformer-based models provide sufficient improvements on fine-tuning tasks upon heavy pretraining, (ii) what is the actual contribution of pretraining in this setting, and (iii) how does BPE tokenization impact performance on genomics-related tasks?