Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

2026-06-02Computation and Language

Computation and Language
AI summary

The authors point out that current tests for language models using tools don't match how real people behave, especially when users are unclear or uncooperative. To fix this, they created RUT-Bench, a new test that includes realistic user behaviors in different conversation types. They tested 19 popular language models and found that none did very well overall, especially with tricky user inputs. This shows room for improvement in how these models handle real-world situations.

large language modelstool-usebenchmarkuser behaviordialogue systemsevaluationmulti-turn conversationsnon-ideal inputssimulationsperformance metrics
Authors
Xuan Yang, Hao Xu, Tingfeng Hui, Hongsheng Xin, Kaike Zhang, Chunxiao Liu, Ning Miao
Abstract
Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark. Experimental results reveal that no tested LLMs achieve an overall success rate above 40%, and nearly all of them experience noticeable performance drops when facing more complicated non-ideal user inputs. Our code and data is available at https://github.com/TorresYangX/RUT-Bench.