DNA Language Models: An Assessment of Pre-Training for Fine-Tuning Tasks

2026-06-29 • Computation and Language

Computation and Language

AI summaryⓘ

The authors studied different types of AI models used to understand DNA sequences, focusing on newer transformer models and older convolutional models. They wanted to see if the extra time and effort needed to train transformer models really leads to better results. They also looked at how a specific way of breaking down DNA sequences, called BPE tokenization, affects the model's performance. Their work aims to clarify which methods are worth using for tasks involving DNA data.

foundation modelsLarge Language Models (LLMs)transformer modelsconvolutional modelspretrainingfine-tuningByte Pair Encoding (BPE)DNA sequence representationgenomics tasksbenchmark evaluation

Authors

Romain Karpinsky, Julien Mozziconacci, Mickaël Delcey

Abstract

Recent breakthroughs in foundation models and Large Language Models (LLMs) have introduced new opportunities for studying and decoding genomic sequences. Several state-of-the-art approaches, such as DNABERT2, rely on transformer-based architectures, while others, such as ConvNova, still build upon more conventional convolutional models. However, systematic benchmark comparisons across these methods remain scarce. Given that transformer-based models require extensive and costly pretraining, it is crucial to evaluate whether their performance gains justify this overhead. Moreover, LLMs such as DNABERT2 typically rely on Byte Pair Encoding (BPE) tokenization, whose relevance for DNA sequence representation is still debated within the genomics community. In this work, we investigate three key questions: (i) do transformer-based models provide sufficient improvements on fine-tuning tasks upon heavy pretraining, (ii) what is the actual contribution of pretraining in this setting, and (iii) how does BPE tokenization impact performance on genomics-related tasks?

View PDFOpen arXiv