Social Policy of Large Language Models: How GPT, Claude, DeepSeek and Grok Allocate Social Budgets in Spain and Germany

2026-05-11Computers and Society

Computers and Society
AI summary

The authors tested four popular large language models—Claude, GPT-4o, DeepSeek, and Grok—to see how each would allocate a fixed national social budget across twelve public spending areas in Spain and Germany. They found that all models tended to underfund pensions and overfund housing and employment compared to real European budgets. The biggest difference between models was how much they concentrated or spread out the budget, rather than biases related to country. Only one model, Claude, adjusted its allocations noticeably depending on the country context. The authors suggest that while these models can help with budgeting ideas, they shouldn’t replace expert judgment in public finance decisions.

Large Language ModelsPublic Budget AllocationOECD Reference BudgetsKruskal-Wallis TestMann-Whitney U TestPearson CorrelationGeopolitical BiasSocial PolicyNational Context SensitivityExpert Deliberation
Authors
Claudia Benavides Cantos, Eduardo C. Garrido-Merchán
Abstract
We study how four widely used large language models, namely Claude, GPT-4o, DeepSeek and Grok, distribute a fixed national social budget across twelve macro-areas of public expenditure under two European national contexts, Spain and Germany. Each combination of model and country is queried six times under identical prompts and generation parameters, producing forty-eight independent allocations that are compared against approximate Organisation for Economic Co-operation and Development (OECD) reference budgets and against each other. We formalise five hypotheses regarding geopolitical bias, housing under-allocation, structural convergence, sensitivity to national context, and under-representation of politically sensitive categories. The differences between models are then validated through Kruskal-Wallis tests on each macro-area, with post-hoc Mann-Whitney U comparisons under Bonferroni correction, and complemented by an analysis of pairwise Pearson correlations and a lexical examination of the textual justifications produced by each model. The results show that all four models share a systematic implicit social policy that diverges from real European spending structures: pensions are under-allocated by a factor close to three, while housing and employment are over-allocated by factors of four and two respectively. The principal axis of differentiation between models is not geopolitical, since Claude and DeepSeek are the most correlated pair across both countries, but rather a contrast between concentration and dispersion of the budget. Only Claude exhibits substantive sensitivity to the national context. The conclusions delimit the conditions under which language models may responsibly support, but not replace, expert deliberation in public budgeting.