Human-on-the-Bridge: Scalable Evaluation for AI Agents

2026-06-15Multiagent Systems

Multiagent Systems
AI summary

The authors propose a new way to test AI agents called Human-on-the-Bridge (HOB), which combines expert human input before testing starts with automated, repeated evaluations. This method lets experts set up rules and scenarios once, then uses a system called ProofAgent Harness to run many tests that catch problems traditional tests miss. Their approach was applied to AI agents in finance, healthcare, and coding, revealing issues like incorrect tool use and policy mistakes. The authors show that HOB improves evaluation quality without needing very large AI evaluators every time.

AI agentsevaluation methodsHuman-on-the-BridgeProofAgent Harnessmulti-turn evaluationred teamingtrace auditingLLM-as-judgepolicy complianceagentic AI
Authors
Fouad Bousetouane
Abstract
AI agents must be evaluated as behavioral systems, not as isolated response generators. They reason across turns, call tools, preserve context, follow policies, and act under uncertainty. Existing methods provide useful but fragmented signals: benchmarks measure fixed capabilities, Human-in-the-Loop review preserves expert judgment but does not scale easily, LLM-as-judge methods depend on evaluator design, red teaming is often episodic, and trace auditing requires explicit evidence rules. This paper introduces Human-on-the-Bridge (HOB), a scalable evaluation paradigm for agentic AI. HOB places human expertise upstream, where experts curate reusable evaluation intelligence before testing begins, including domain context, Red-Team Traps, Juror Personas, scoring guidelines, audit rules, and fallback policies. ProofAgent Harness then executes this curated intelligence repeatedly through multi-turn adversarial evaluations, trace capture, multi-juror scoring, and evidence-linked reporting. We evaluate HOB through symmetric and cost-efficient asymmetric settings across frontier LLM-based agents and Harness LLM tiers. The study covers 23,500 agent turns and produces evidence-linked findings across finance, healthcare, and code generation. The results show that HOB can amplify evaluation quality without requiring equally large evaluator models, allowing smaller Harness LLMs to challenge agents built on frontier LLM backbones. The evaluation surfaces failures often missed by static benchmarks and single-evaluator scoring, including phantom tool-call claims, missing mandatory tool calls, policy drift, manipulation paths, and safe but non-resolving refusals. These findings support HOB as a paradigm for scaling human-curated evaluation intelligence, where expert judgment is encoded upfront and reused across repeated agent evaluations rather than applied manually inside every run.