Towards Fast Domain Adaptation and Fine-Grained User Simulation for Evaluating Conversational Recommender Systems

2026-06-22Information Retrieval

Information Retrieval
AI summary

The authors discuss how systems that recommend things through conversations are hard to evaluate. They point out problems with current simulators, like being too stuck in one domain and not capturing how people really talk or change their preferences. To fix this, they create AdaptSim, which adapts to new topics automatically and models user behavior more realistically. They also introduce a new way to compare conversations turn by turn to better test these systems. Their tests show AdaptSim helps evaluate recommendation systems more accurately and reliably across different areas.

Conversational Recommender SystemsLarge Language ModelsUser SimulatorDomain AdaptabilityPrompt TuningControlled Text GenerationThink-then-respond StrategyBreadth-First SearchEvaluation MetricsDialogue Systems
Authors
Yuanzi Li, Quanyu Dai, Xueyang Feng, Zihang Tian, Junhao Wang, Xu Chen, Zhenhua Dong, Huifeng Guo
Abstract
Conversational Recommender Systems (CRSs) enhance user experience through multi-turn interactions, yet evaluating their performance remains challenging. While Large Language Model (LLM) based user simulators are effective, they suffer from three key limitations: (1) Lack of Domain Adaptability: Reliance on fixed prompts and predefined action spaces hinders transfer to novel domains; (2) Limited User Modeling: Inability to accurately replicate subtle linguistic styles and dynamic preferences; (3) Insufficient Evaluation Validity: Existing simulators fail to adequately assess fundamental capabilities and system robustness. To overcome these, we propose AdaptSim, an Adaptive domain and automatic prompt tuning User Simulator. AdaptSim offers an efficient framework for evaluating CRSs by enabling realistic behavior modeling and diverse style generation. It leverages automatic prompt generation and an open action mechanism to reduce manual effort and improve cross-domain flexibility. For response generation, we employ controlled text generation with a "think-then-respond" strategy for fine-grained control over language style. For CRS evaluation, AdaptSim incorporates a novel Breadth-First Search (BFS)-based, turn-level pairwise comparison framework for comprehensive assessment. Extensive experiments across three domains and four LLMs demonstrate that AdaptSim generates realistic dialogues, enabling a highly effective and reliable evaluation of CRS capabilities and robustness.