UXBench: Measuring the Actionability of LLM-Generated UX Critiques

2026-06-15Software Engineering

Software EngineeringArtificial Intelligence
AI summary

The authors created UXBench, a test to see how well large language models (LLMs) can judge user interfaces by exploring and reporting usability issues. They built different web setups for the models to interact with and required them to gather evidence before giving feedback. The models’ reports were checked by seeing if a repair tool could fix the interfaces based on their critiques. The authors found that models vary a lot in how useful and reliable their UX judgments are, and no single model is best in all situations.

large language modelsuser experience (UX)usability testingbenchmarkinteraction designweb interfacesautomated repairevaluation metrics
Authors
Wenjie Wang, Yue Huang, Zipeng Ling, Han Bao, Hang hua, Xiaonan Luo, Yu Jiang, Shiyi Du, Yuexing Hao, Xiaomin Li, Yuchen Ma, Dianzhuo Wang, Yanfang Ye, Xiangliang Zhang
Abstract
Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories