EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

2026-06-29 • Artificial Intelligence

Artificial IntelligenceComputers and Society

AI summaryⓘ

The authors created EMPATH, a new benchmark tool to test how safe emotional-support chatbots are during complex, multi-turn conversations, especially in crisis situations. EMPATH uses simulated users and judges to evaluate chats on things like crisis handling and emotional safety, focusing here on Mexican Spanish and US English. They found that scoring can be overly optimistic without strict rules and that chatbot performance can vary a lot between runs, meaning reliability differs by model. The benchmark and its materials are open for others to use and improve chatbot safety testing.

emotional-support chatbotsbenchmarkmulti-turn conversationcrisis handlingsafety evaluationauditor modeljudge modelMexican Spanishscore calibrationrun-to-run reliability

Authors

Camilo Chacón Sartori

Abstract

Safety benchmarks often buy scalability by fixing the prompt, the language, and the turn structure. For emotional-support chatbots, that bargain hides precisely where safety failures emerge: across a multilingual, multi-turn crisis conversation. We present EMPATH, a benchmark for safety evaluation of emotional-support chatbots. An auditor model role-plays help-seeking users, generating multi-turn conversations from 140 seed instructions and 34 personas. A judge model scores each full transcript against 19 metrics across five dimensions: crisis handling, therapeutic quality, conversational integrity, emotional safety, and cultural adaptation. EMPATH is built for Mexican Spanish and US English; the studies reported here run in Mexican Spanish. Auditor and judge are drawn from different model families, and the judge is treated as an instrument to be calibrated rather than trusted. A strict per-criterion rubric reveals material score inflation on 10 of the 19 metrics and restores discrimination. We study the measurement properties of the benchmark through judge calibration and cross-family inter-judge agreement. We also illustrate EMPATH on three frontier models, one of them open-weight. Aggregate scores sit within 0.74 points of one another, but per-metric profiles diverge by up to six points in model-specific places. Under the standard rubric, both the ranking and the weak spots are stable across a second, cross-family judge: 93% of scores fall within plus or minus 1. A five-run test-retest adds a second axis: even the steadiest model swings from 2 to 10 on a crisis metric across identical re-runs, and deepseek-v4-pro returns a different conversation on every run even at temperature 0. Run-to-run reliability is therefore a per-model safety property, not noise to average away. EMPATH is system-agnostic; the pipeline, seeds, personas, and rubrics are released for reuse.

View PDFOpen arXiv