Beyond Literal Translation: Evaluating Cultural Effectiveness in Social Media UGC
2026-05-25 • Computation and Language
Computation and Language
AI summaryⓘ
The authors created CULTURE-MT, a new test for translating social media posts that checks if translations keep the original cultural meaning and emotions. They collected over 1,000 posts from many topics and grouped them by special cultural symbols and styles. They trained two new models and tested 15 others, finding that old ways to measure translation don't work well for culture. Their results show bigger models do better with culture, and they made a website so others can test their translations too.
social media translationuser-generated content (UGC)large language models (LLMs)cultural transmissionemotion resonancebenchmark datasetevaluation metricsQwen3 modelstranslation qualitycultural adaptability
Authors
Linjuan Wu, Ruiqi Zhang, Xinze Lyu, Ye Guo, Daoxin Zhang, Zhe Xu, Yao Hu, Yixin Cao, Yongliang Shen, Weiming Lu
Abstract
Social media platforms enable large-scale cross-lingual communication, but translating user-generated content (UGC) remains challenging due to its informal style, cultural references, and interaction-based expressions. While recent LLMs have improved translation quality, existing benchmarks and metrics often fail to capture whether translations convey intended meaning and cultural resonance in real-world settings. In this work, we introduce CULTURE-MT, a benchmark for social media translation that focuses on both CULtural Transmission and UGC-specific emotion REsonance. CULTURE-MT consists of 1,002 UGC notes across 14 domains, categorized into four types based on culture-loaded symbols and linguistic style features. We also construct UGC-oriented training data to fine-tune Qwen3-8B and Qwen3-32B as baselines. We propose cultural effectiveness as a new evaluation criterion, focusing on expression accuracy and cultural adaptability. Testing 15 models, including the baselines, we find that traditional metrics fail to capture cultural effectiveness. We also observe that cultural effectiveness on base LLMs correlates with model size. Our work provides a comprehensive evaluation system for UGC translation models and will offer an open evaluation platform to advance research in this area. We release the CULTURE-MT benchmark and provide an online leaderboard where submitted translation results can be evaluated by our trained JUDGER.