HLL: Can Agents Cross Humanity's Last Line of Verification?

2026-06-01Artificial Intelligence

Artificial IntelligenceComputation and LanguageComputer Vision and Pattern RecognitionMachine LearningMultimedia
AI summary

The authors created a test called Humanity's Last Line of Verification (HLL) to see if AI agents can pass CAPTCHAs, which are puzzles used to confirm that a user is human. They tested eight advanced AI systems in a realistic setting with different types of CAPTCHAs and challenges like messy web pages. The results showed that current AI agents struggle to reliably solve CAPTCHAs, especially when they need to act like humans and follow correct steps. This work highlights key areas where AI still falls short in replacing humans for tasks that are meant to block automation.

Multimodal agentsCAPTCHAHuman verificationGUI environmentLocalizationAction calibrationState trackingProcess consistencyAutomationBenchmark
Authors
Xinhao Song, Su Su, Sirui Song, Hongliang Wu, Wen Shen, Zhihua Wei, Gongshen Liu, Linfeng Zhang, Dongrui Liu
Abstract
Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL