K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

2026-06-01 • Computation and Language

Computation and Language

AI summaryⓘ

The authors created K-BrowseComp, a new benchmark with 400 web browsing tasks focused on Korean content to test how well AI models can handle complex, action-based tasks. They carefully checked 300 of these tasks with native Korean speakers and found that top models like GPT-5.5 scored just 30-46%, while Korean-specific models scored much lower, below 11%. They also made a harder set of 100 tasks that challenge AI even more, where the best model scored only 26%. The authors shared all their data and code so others can use this benchmark to improve AI performance in Korean web browsing tasks.

Large Language ModelsAgentic TasksBenchmarkWeb Browsing AgentsKorean LanguageFew-shot LearningAdversarial TestingSynthetic DataAI Evaluation

Authors

Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim

Abstract

Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00--45.67\%, a substantial drop from BrowseComp, while Korean LLMs released through Korea's Proprietary AI Foundation Model program obtain only 0.00--10.33\%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00\%, and we report this split separately as a targeted stress test. We publicly release our data and code.

View PDFOpen arXiv