In-Context Learning for the Imputation of Public Opinion Data with Large Language Models

2026-06-08Computation and Language

Computation and Language
AI summary

The authors studied how to fill in missing answers in surveys using large language models through a method called in-context learning (ICL). They tested their approach on real survey data where some answers were missing for different reasons. Their ICL method did better than traditional statistical methods, especially when the missing answers weren't random. The authors also created a Python tool to make it easier for others to use their technique with different language models.

Large Language ModelsIn-Context LearningSurvey Data ImputationMissing Data MechanismsMCARMARMNARMICE PMMConfidence IntervalsPython Package
Authors
Tobias Holtdirk, Georg Ahnert, Joseph W Sakshaug, Anna-Carolina Haensch
Abstract
Large language models have been widely evaluated as simulators of individual survey responses. In practice, however, fully unobserved responses are rare; the dominant problem is partial non-response. Imputation aims to restore the overall structure of a survey dataset by filling in these missing values. It has its own well-defined evaluation criteria and differs fundamentally from prediction. We propose to impute missing survey data through in-context learning (ICL). We systematically evaluate ICL design choices across different missingness mechanisms (MCAR, MAR, MNAR) on 150 opinion variables spanning 15 waves of the American Trends Panel. Compared to well-established statistical methods for data imputation like MICE PMM, our ICL approach consistently reduces absolute error across all missingness mechanisms, with the largest gains under non-random missingness (MNAR). Notably, the best-performing specification (gpt-oss-120b with 100 in-context examples) achieves near-nominal aggregate coverage (approaching the 95% level) with confidence intervals two to five times narrower than MICE PMM. We publish a Python package with an sklearn-like API to enable easy deployment of our method using local and proprietary LLMs.