UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

2026-04-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language

AI summaryⓘ

The authors present UI-Zoomer, a method that smartly zooms in on uncertain parts of interface screenshots to better locate small or crowded icons based on natural language queries. Instead of always zooming in the same way, their system decides when and how much to zoom by measuring how uncertain the model is about its current prediction. This approach improves accuracy without needing extra training. They tested UI-Zoomer on several datasets and saw consistent improvements across different models.

GUI groundinginterface elementsnatural language queriesuncertainty quantificationzoom-in methodsspatial consensustoken-level confidencecrop sizingprediction variancelaw of total variance

Authors

Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

View PDFOpen arXiv