The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs

2026-05-11 • Artificial Intelligence

Artificial IntelligenceComputation and LanguageMachine Learning

AI summaryⓘ

The authors created the Metacognitive Probe, a tool with five tasks designed to measure how well large language models (LLMs) understand their own confidence in different ways. They tested it on eight advanced models and 69 humans to see how models know when they are likely right or wrong. Traditional benchmarks only check if a model’s answers are correct, but this tool reveals where a model might be confidently wrong. For example, they found one model, Gemini 2.5 Flash, scored very well at judging confidence within a task but poorly at predicting difficulty across tasks. This shows the tool can uncover detailed strengths and weaknesses in model self-awareness that other tests miss.

MetacognitionConfidence calibrationLarge language modelsEpistemic vigilanceKnowledge boundaryCalibration rangeReasoning-chain validationModel evaluationBenchmarkingSpearman correlation

Authors

Rafael C. T. Oliveira

Abstract

The Metacognitive Probe is an exploratory five-task, 15-slot diagnostic that decomposes an LLM's confidence behaviour into five behaviourally-distinct dimensions: confidence calibration (T1-CC), epistemic vigilance (T2-EV), knowledge boundary (T3-KB), calibration range (T4-CR), and reasoning-chain validation (T5-RCV). It is evaluated on N=8 frontier models and N=69 humans. The instrument is motivated by Flavell (1979) and Nelson and Narens (1990) but operates on observable confidence-correctness alignment; it is not a validated cross-species metacognition scale, and the pre-specified human developmental hypothesis was falsified. Composite benchmarks (MMLU, BIG-Bench, HELM, GPQA) ask whether a model produces a correct response. They are silent on whether the model knows when its response is wrong. A model can score 80 on a composite calibration benchmark and still be wildly overconfident in narrow pockets the aggregate cannot surface. The Metacognitive Probe surfaces those pockets. Our headline is a 47-point within-model dissociation in Gemini 2.5 Flash: panel-best within-task calibration (T1-CC = 88; Spearman rho = +0.551, 95% CI [+0.14, +0.80], p = 0.005) and panel-worst cross-task difficulty prediction (T4-CR = 41; sigma_conf = 1.4 across twelve factoids).

View PDFOpen arXiv