How reliable are LLMs when it comes to playing dice?

2026-06-05Computation and Language

Computation and LanguageArtificial IntelligenceHuman-Computer Interaction
AI summary

The authors tested how well large language models (LLMs) understand probability by giving them two types of problems: normal ones and tricky ones that often confuse people. They found that the models did very well on the easy problems but struggled a lot on the tricky ones. They also showed that the way questions are worded matters a lot—when the usual wording was changed or misleading hints were added, the models' performance dropped a lot. Overall, the authors conclude that despite their math skills, current LLMs don't truly grasp probabilistic reasoning yet.

large language modelsprobabilistic reasoningdiscrete probabilityheuristic reasoningChain-of-Thought promptingbenchmarkingtoken biasprompt engineeringmodel accuracy
Authors
Luca Avena, Gianmarco Bet, Bernardo Busoni
Abstract
We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.