SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

2026-06-11Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning
AI summary

The authors created SkMTEB, a new benchmark to test how well text embedding models work for Slovak, a language with few resources. They tested 31 models and found that large, instruction-tuned multilingual models do best, while Slovak-specific models trained for understanding tasks don’t transfer well to embedding tasks. To help with practical use, they developed smaller Slovak embedding models by trimming and fine-tuning bigger multilingual ones, achieving good results while being easy to run locally. They released all their work openly to help others improve language technology for low-resource languages.

text embeddingsSlovak languagelow-resource languagebenchmarkinstruction tuningfine-tuningmultilingual modelssemantic searchretrieval-augmented generationmodel compression
Authors
Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová
Abstract
We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.