Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

2026-06-15 • Computer Science and Game Theory

Computer Science and Game TheoryArtificial Intelligence

AI summaryⓘ

The authors build on previous work that priced risky actions of autonomous AI agents against a safe, fixed default, but where the operator was seen as passive. They treat the operator as strategic and identify five ways the system could be attacked or exploited. They show how existing contract rules fix two attack types and propose new contract clauses to prevent the other three, such as preventing manipulation of system tolls and ensuring reliable interface communication. Combining these rules creates a contract that aligns operator incentives with safe AI operation, confirmed by mathematical guarantees and empirical tests. The paper effectively adds a layer to control AI side effects based on incentives.

actuarial runtimeautonomous AI agentsincentive compatibilityoperator strategysafe-default actioncontract clausesinterface compliancemodel-identity menubudget balanceattack surfaces

Authors

Hao-Hsuan Chen

Abstract

Paper A defines a time-consistent actuarial runtime that prices each side-effect-bearing action against a contractually fixed safe default and gates execution against a reserve budget. It treats the operator as passive. This paper makes the operator strategic. We characterise a five-attack space for autonomous AI-agent insurance contracts and prove when the actuarial runtime is gaming-resistant. Two attack surfaces -- post-toll safe-default selection and within-boundary action splitting -- are closed by Paper A's minimal-authority and no-splitting clauses. The remaining three require new contract clauses. First, common-control aggregation prevents cross-boundary re-routing from reducing toll below the boundary potential applied to total exposure. Second, interface failures such as invalid JSON are contract-relevant events, not safety wins: treating them as zero-toll safe defaults can reward unreliable models, while escalation fees reverse the incentive. We validate this interface-compliance theorem on committed cross-model traces from the companion empirical paper. Third, a model-identity menu with a componentwise-minimum penalty schedule makes truthful reporting of the deployed model weakly dominant. We then compose these clauses with Paper A's runtime guarantees to obtain joint incentive compatibility over the five-attack space. Finally, a two-parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium. The result is an incentive-compatibility layer for actuarial control of autonomous-agent side effects.

View PDFOpen arXiv