When Robots Rate Their Own Interactions: Engagement Validity and the Strangeness Failure

2026-06-22Robotics

Robotics
AI summary

The authors studied how robots can evaluate interactions with humans using large language models (LLMs), essentially reversing the usual process where only humans rate the experience. They found that LLMs could reliably judge how engaging an interaction was, but struggled to correctly assess feelings related to comfort and strangeness, often mixing these up. This limitation was consistent across different models and real robot interactions. The authors suggest that LLMs alone can't fully understand internal feelings like discomfort without additional data like body signals or eye movement. Their work shows the limits of using LLMs for robot self-evaluation in human-robot interaction.

Human-Robot Interaction (HRI)Large Language Models (LLMs)HRI-CUESGodspeed QuestionnaireRoSASEngagementComfortStrangenessAffective StatesPhysiological Signals
Authors
Victor Lockwood, Hasan Mahmud, Mohammad Javad Khojasteh, Prabu David, Jamison Heard
Abstract
Human-robot interaction (HRI) evaluation relies almost exclusively on human-completed questionnaires, leaving the robot's perspective unexamined. We propose an \textit{inverted evaluation}, in which LLM-powered robots complete the same standardized instruments from their own perspective, and test whether these ratings agree with human ground truth. In Study~1, five LLMs completed HRI-CUES, Godspeed, and RoSAS questionnaires for 25~interactions ($N = 1{,}522$ evaluations) from the HRI-CUES dataset. LLMs achieved moderate-to-strong agreement on engagement dimensions (satisfaction $r$ up to $.65$ and enjoyment $r$ up to $.72$) with excellent test-retest reliability (ICC $\geq .82$), but \textit{systematically inverted} the comfort/strangeness dimension ($r = -.44$ to $-.67$, all $p < .05$), conflating engagement with comfort. In Study~2, a Nao robot running Claude~Sonnet~4.5 replicated these patterns in live interactions ($N = 4$), including real-time turn-by-turn assessment. The strangeness failure persisted across five models, synthetic controls, and embodied deployment for two participants. We argue that current LLM-based robots lack access to the internal affective states needed to assess constructs like strangeness, and that inverted evaluation requires supplementary modalities (e.g., physiological signals, gaze, proxemics) to move beyond behavioral proxies. These findings establish boundary conditions for using LLMs as interaction evaluators in HRI.