Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level
2026-06-08 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied how well large language models (LLMs) can mimic human responses in a survey about Korean instant noodle purchases. They found that while LLMs do a decent job matching average human choices, they struggle to match the full range and variation of human answers, especially for how many noodles people buy. This means just checking average responses can give a misleading idea of how realistic the model's answers are. The authors also found that adding structured information about the simulated person and using images can help, but asking the model to explain its choices makes results worse.
Large Language ModelsSurvey Response SimulationDistributional AlignmentConsumer Choice ExperimentBinary ResponseCategorical ChoiceCount DataMean-level EvaluationPrompt EngineeringMultimodal Input
Authors
Jeonghyeon Moon, Jiwon Kim, Yeheum Lah, Yoonju Han, Yuncheol Kang
Abstract
LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication using mean-level or aggregate agreement, offering limited insight into whether LLMs reproduce the variability of human behavior. We evaluate LLM-based survey replication at the distributional level using a non-public 2010 consumer choice experiment on Korean instant noodle purchases, a setting unlikely to overlap with model training data. We evaluate three response variables of differing statistical type: binary purchase incidence, categorical brand choice, and count purchase quantity. For each, we compare human and LLM responses at mean-level, pattern, and distributional alignment, and against reference baselines from the human data alone. LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.