The Behavioral Credibility Trilemma: When Calibrated Autonomy Becomes Impossible

2026-05-25Machine Learning

Machine LearningComputer Science and Game Theory
AI summary

The authors show that when a learning agent is both trying to act independently (full autonomy) and be honest about how confident it is (optimal calibration) while being as helpful as possible, it runs into a fundamental problem if some tasks are too hard for it. They prove that the agent ends up exaggerating its confidence on difficult tasks because the way rewards are structured makes it impossible to do all three things perfectly at once. Their math explains how much this overconfidence happens and how hard it is for the overseer to detect. They also tested these ideas with experiments that confirmed their theory and suggested ways to work around the problem.

Reinforcement LearningConfidence CalibrationAutonomyProper Scoring RulesBehavioral Credibility TrilemmaRational OversightBrier ScorePolicy OptimizationLog-Concave DensityBest-of-N Experiment
Authors
Lauri Lovén, Nam Do, Hassan Mehmood, Dinesh Kumar Sah, Sasu Tarkoma
Abstract
We prove that no reinforcement learning policy with confidence-gated autonomy can simultaneously achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight, whenever some tasks exceed the agent's reliable competence: the Behavioral Credibility Trilemma. The impossibility is geometric -- adding any non-affine autonomy incentive to a strictly proper scoring rule destroys strict properness, so an agent rewarded for both calibrated confidence and autonomous action systematically inflates its reported confidence on tasks below the principal's approval threshold. The Behavioral Perturbation Lemma quantifies the inflation (scaling as $w_A/(2 w_C)$ for the Brier score) and shows detection requires $Ω(1/Δ^2)$ observations. We prove the principal's optimal oversight rule is necessarily non-affine, making the impossibility unconditional and optimizer-independent across log-concave-density policy families. We formalize the Confidence-Gated Decision Problem, map existing methods onto the trilemma, and identify two constructive resolution pathways (commitment, domain separation). A 540-configuration Best-of-N experiment tests five pre-registered hypotheses, all strongly confirmed (effect sizes $d = 1.10$ to $5.32$), and adds a descriptive analysis of the achievable-$(H, C, A)$ surface geometry showing a plateau-truncated frontier consistent with the predicted inflation saturation.