REPAIR-Bench: A Benchmark for Robot Error Perception And Interaction Recovery

2026-06-29 • Robotics

Robotics

AI summaryⓘ

The authors created REPAIR-Bench, a dataset and set of tasks to better understand how people notice and react to problems robots have during interactions. Their work goes beyond just spotting failures as yes/no events and looks at how users adapt over multiple times and the kinds of failures that happen. They tested models that detect failures, figure out what type they are, and predict how users want robots to fix them, finding that more complex, context-aware models worked better. This benchmark aims to help researchers build robots that handle mistakes in a smarter and more user-friendly way.

human-robot interactionfailure detectionfailure classificationrecovery predictionhierarchical recurrent modelingfacial action unitslongitudinal user adaptationspeech transcriptsQLoRAbenchmark dataset

Authors

Giuliano Pioldi, Yashika Batra, Arman Ibrayeva, Yuanchen Bai, Purnjay Maruur, Promise Ekpo, Angelique Taylor

Abstract

Understanding how users perceive and respond to robot failures is essential for building robust and trustworthy robot systems. Prior work, however, (i) often treats failures as independent events, (ii) emphasizes binary failure detection, (iii) with rule-based recovery modeling. We present REPAIR-Bench, built on 214 interaction trials from 41 participants, the benchmark spans four induced failure types and provides synchronized facial action units, head pose, speech transcripts, and post-interaction affect and recovery reports. The benchmark spans three novel evaluation tasks that jointly capture the lifecycle of failure in human-robot interaction (HRI): (i) failure detection over inter-dependent interaction sessions, modeling longitudinal user adaptation across repeated failures; (ii) visual failure-type classification beyond binary success/failure formulations; and (iii) user-centered recovery prediction, inferring users' preferred recovery strategies from interaction context rather than relying on manually designed or rule-based strategies. In baseline experiments, hierarchical recurrent modeling improved failure detection over a single-session model (strict F1: 0.80 vs. 0.68), achieved a failure localization mean signed error of -0.51 s, median absolute error of 2.97 s and, for recovery prediction, a QLoRA-tuned Mistral-7B reached Hit@5=0.76 and F1@5=0.32. REPAIR-Bench provides both the HRI and Medical HRI communities with a standardized framework for (1) evaluating robot failures and (2) building transparent, adaptive, and trustworthy recovery systems.

View PDFOpen arXiv