A comparison of human and LLM-simulated participants in a writing style task

2026-06-15Human-Computer Interaction

Human-Computer Interaction
AI summary

The authors tested whether a large language model (LLM), specifically GPT-4o, can act like a fake human in experiments that usually involve real people. They compared how well the LLM and 30 humans performed in a task about guessing writing style preferences over time. They found that GPT-4o showed some biases and wasn't deep enough in understanding to fully mimic human behavior. The authors suggest that these differences point to challenges when using LLMs instead of people, and also highlight human biases in such studies. This work is presented as a detailed look at the difficulties in designing and evaluating these LLM-based simulations.

Large Language ModelsGPT-4oSimulationHuman-Computer InteractionWriting Style PreferenceBiasSynthetic ParticipantsHuman-Automation InteractionEvaluation MethodsBehavioral Modeling
Authors
Felix Gröner, Erin K. Chiou
Abstract
Because large language models (LLMs) can produce natural language that is sometimes indistinguishable from texts produced by people, some researchers are starting to consider replacing human participants with LLM simulations. In this study, we test the extent to which the findings of a simulation with an LLM prompted to act as a synthetic participant match those obtained from 30 human participants. In our experiments, we evaluated how well writing style preference inference algorithms adapted to a participant over repeated interactions, compared to a baseline. We discover hints of bias and a lack of depth in GPT-4o's text generation and judgement that prevent it from accurately simulating people's behavior. Our results also hint at human biases that highlight the importance of considering human factors in the evaluation of systems that depend on human-automation interaction. Rather than treating these discrepancies as evidence for or against the validity of LLM-simulated participants, we present this study as a case analysis of methodological and design challenges.