CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

2026-06-01Computation and Language

Computation and Language
AI summary

The authors note that previous studies mostly checked if large language models (LLMs) know about different cultures but not if they can use that knowledge well in real situations. They created CultureForest, a test with thousands of questions based on cultural norms from many countries, to better see how well models apply cultural knowledge. Their experiments show that although models have cultural facts, they struggle with open-ended questions and often give cautious answers, especially when cultural rules are strict. The authors suggest we should focus more on testing how models reason with cultural knowledge, not just what they know.

Cultural intelligenceLarge language modelsCultural normsBenchmarkOpen-ended reasoningKnowledge groundingCross-cultural evaluationModel biasReasoning ability
Authors
Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao, Zihan Zhang, Xiachong Feng, Baosong Yang, Bing Qin
Abstract
Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm Grounded Reasoning}. Each question is grounded in a small set of atomic norms, enabling verifiable and attributable evaluation. CultureForest comprises 5,378 examples across 8 domains and 53 countries/regions, and supports a progressive evaluation from multiple-choice to open-ended generation. Extensive experiments reveal that even top-tier models degrade substantially in open-ended settings, accompanied by pronounced cross-region disparities. Through targeted analysis, we uncover several consistent patterns: (1) test-time reasoning yields limited gains and may exacerbate inequity; (2) models exhibit highly shared regional preference structures; (3) model responses are markedly conservative, especially under stricter cultural constraints; and (4) by disentangling cultural knowledge acquisition from cultural reasoning, we show that while LLMs possess substantial cultural knowledge, their performance is further bottlenecked by its effective use. These findings point to a necessary shift from knowledge-centric evaluation toward measuring knowledge-grounded reasoning.