KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts
2026-06-01 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors created KliniskVestBERT, which is a set of language models designed specifically to understand Norwegian clinical text by training on real, anonymized healthcare documents. They improved existing Norwegian language models by further training them with a wide range of clinical notes like discharge summaries and surgical reports. When tested on both fake and real clinical tasks, their specialized models did better than the original ones, showing that training on healthcare-specific language helps. This work was a collaborative effort among several Norwegian healthcare organizations.
Natural Language Processing (NLP)BERTclinical language modelspretrainingNorwegian languageclinical textsdomain-specific trainingdischarge summarieshealthcare NLPlanguage model evaluation
Authors
Christian Autenried, Cosimo Persia
Abstract
The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexities of clinical language. This work introduces KliniskVestBERT, a suite of three BERT-based encoder models pre-trained on a substantial corpus of real-world, de-identified Norwegian clinical texts from Helse Vest. We continue pretraining existing language models Nb-BERT-large, NorBERT3-large, and ModernBERT on our specialized clinical dataset. This dataset is based on a representative population of Helse Vest patients. The included document types are carefully curated to encompass a broad clinical spectrum in bokmål and nynorsk including discharge summaries, surgical reports, nursing notes etc. ensuring comprehensive representation of the linguistic landscape within Norwegian healthcare settings. Evaluation on three synthtetic Norwegian clinical benchmark datasets and two real-world problems demonstrates that each of our clinically specialized models consistently outperforms their baseline counterparts, highlighting the significant benefit of domain-specific pre-training for NLP tasks within the clinical domain. The project was a joint effort by all Helse Vest entities (Helse Bergen, Helse Fonna, Helse Førde and Helse Stavanger) with DIPS under the project lead of Helse Vest ICT.