SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models
2026-05-25 • Computation and Language
Computation and LanguageArtificial IntelligenceComputers and Society
AI summaryⓘ
The authors studied how well four language models avoid harmful responses in Somali compared to English, using a specially created SomaliBench dataset. They found that all models were less likely to refuse harmful prompts in Somali than in English, sometimes producing unclear or nonsensical replies instead of harmful ones. A native Somali speaker verified the evaluation method's accuracy, confirming the model's refusal assessments were reliable. The authors only report summary statistics without sharing the actual model outputs.
large language modelssafety evaluationinstruction-tuningharmful contentlow-resource languagesSomaliBenchrefusal rateCohen's kappamodel alignmentcross-lingual evaluation
Authors
Khalid Yusuf Dahir
Abstract
Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.