Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis
2026-06-08 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors show that simply translating safety tests for large language models (LLMs) from English into other languages misses important cultural details tied to real-world threats and social norms. They created pairs of test sets for four languages, one directly translated and one culturally adapted, and found that the adapted tests reveal more safety risks than direct translations do. Their analysis shows that direct translation scores lower on cultural realism and underestimates dangers across many categories and languages. The study suggests that evaluating LLM safety properly requires tailoring tests to each language's cultural context instead of just translating text.
large language modelsmultilingual evaluationcultural adaptationdirect translationsafety benchmarksattack success ratecultural realismlanguage-specific contextthreat scenarios
Authors
Hyeji Choi, Yongtaek Lim, Minwoo Kim
Abstract
Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.