No Single Best Model for Diversity: Learning a Router for Sample Diversity

2026-04-02 • Computation and Language

Computation and Language

AI summaryⓘ

The authors study how to get a large set of good and different answers from language models when questions have many valid answers. They create a new way to measure how well models cover diverse answers, called diversity coverage. Testing 18 language models, they find no single model is best for all questions, but each question has one model that works best. To use this, they build a system that picks the best model for each question and show it does better than using just one model. Their work helps explore ways to combine multiple models for more complete answers.

diversity coveragelanguage modelsanswer generationmodel evaluationprompting strategiesensemble methodsopen-ended promptsNB-Wildchatout-of-domain generalization

Authors

Yuhan Liu, Fangyuan Xu, Vishakh Padmakumar, Daphne Ippolito, Eunsol Choi

Abstract

When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.

View PDFOpen arXiv