Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

2026-06-08 • Artificial Intelligence

Artificial IntelligenceCryptography and SecurityMachine Learning

AI summaryⓘ

The authors explore how large language model (LLM) agents are supervised when they perform important actions like editing files or running commands. They explain that having a human review risky actions isn't straightforward because people often disagree on what counts as risky, and humans get tired if they review too many actions. Their study shows that more human oversight doesn't always mean better safety—too much review can actually reduce safety due to reviewer fatigue. They provide an open-source system to measure and manage this balance between automated checks and human review, helping decide when to involve humans without overloading them.

LLM agentshuman-in-the-looprisk assessmentselective classificationreviewer fatigueescalation policycost-sensitive deferralflooding attackguard mechanismresource allocation

Authors

Emre Turan

Abstract

As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of "risky," and that the human reviewer is a perfect, infinitely-available oracle. On a hand-labeled set of 125 adversarially-weighted agent actions we show that (i) reviewers only moderately agree on what is risky (Fleiss' kappa = 0.52), so there is no single correct label; (ii) framing the guard as selective classification under asymmetric cost makes its operating limits measurable, and on hard inputs the guard cannot safely auto-decide; and (iii) when the reviewer is modeled as endogenous (fatiguing as escalation load grows), realized safety becomes an inverted-U in the escalation rate: more human oversight can make a system less safe, and the safety-optimal guard escalates below full escalation - a setting a load-aware policy also uses to resist a flooding attack that slips a malicious action past a fatigued reviewer. Agent oversight, framed this way, is not only a classification problem but a resource-allocation one: human attention is finite, and the guard's escalation policy spends it. We claim none of these mechanisms as novel - fatigue-aware learning-to-defer (FALCON), cost-sensitive deferral under workload constraints (DeCCaF), trajectory-level guarding, and reviewer-fatigue/flooding attacks are all prior art we cite. Our contribution is an open-source agent-oversight system that operationalizes and measures them in the LLM-agent action-gating setting, turning "is my guard good?" from a guess into a curve. The inverted-U and the flooding attack are modeling results that motivate a human study.

View PDFOpen arXiv