ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents

2026-06-22Artificial Intelligence

Artificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors address the challenge of teaching computer agents to control real desktop software by performing long sequences of mouse and keyboard actions effectively. They create a new method called Environment-Native Verified Search (ENVS), which uses a kind of trial-and-error search within a virtual desktop to find and verify good actions before training the agent, improving success rates while saving computing power. They also introduce OSWorld-Noisy, a test setup that includes random interruptions to see if agents can recover during tasks. Their approach outperforms previous methods and is more robust to distractions, while preserving visual understanding skills.

multimodal agentsGUI automationreinforcement learningsearch-and-filterOSWorld benchmarkpolicy optimizationvirtual machinesinterruptions recoveryvisual reasoninglong-horizon tasks
Authors
Yincheng Zhou, Athena Zhuoming Zhong, Shijie Zhang, Kevin Zhang, Teresa Xiaotao Shang, Shanghang Zhang
Abstract
As multimodal agents move from interface understanding to real software control, successful trajectory discovery in live desktop environments becomes a key challenge. GUI tasks require long-horizon sequences of precise mouse and keyboard actions, while feedback is sparse, delayed, and costly to obtain through VM rollouts. We propose Environment-Native Verified Search (ENVS), a training-time search-and-filter pipeline that uses the environment to construct verified supervision before policy optimization: it branches over behaviorally distinct GUI actions in live OSWorld VMs, verifies successful leaves, and trains from globally balanced step-level supervision. To evaluate robustness under realistic desktop interruptions, we also introduce OSWorld-Noisy, a dynamic benchmark for recoverable desktop interruptions that preserves the original tasks while testing whether agents can refocus, dismiss, wait, or recover under live perturbations. On the 300-task OSWorld pool, ENVS reaches 30.3 pass@8 on original evaluations and 29.0 on OSWorld-Noisy, outperforming matched ARPO-style online RL while reducing compute from 184-192 to 138-153 GPU-hours; even with only 30% of its search data, ENVS reaches 27.0 pass@8, exceeding ARPO from the base model. Training from noisy environments also better preserves visual-reasoning abilities on auxiliary benchmarks, including OSWorld-G Refusal (16.7 vs. 1.9) and BLINK Functional Correspondence (26.2 vs. 23.1).