SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

2026-06-01 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors created SMH-Bench, a new test to see how well large language models (LLMs) can control smart homes by understanding user needs and managing many devices. This test uses a detailed smart-home simulator called HomeEnv and includes 1,100 tasks of different types and complexity levels, from small apartments to big homes with many devices. Their results show that while current LLMs are good at direct commands and questions, they struggle with planning automation, handling unclear instructions, and personalizing decisions as the home gets more complicated. The authors suggest SMH-Bench can help improve smart-home AI to be more reliable and aware of context.

Smart HomeLarge Language ModelsHomeEnv SimulatorTask BenchmarkUser IntentAutomation SchedulingContext-Aware AIMulti-Device InteractionAmbiguity HandlingPersonalized Reasoning

Authors

Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu, Zecheng Sheng, Yi Gu, Weipeng Ming, Lei Xue, Chen Liu, Sen Hu, Ronghao Chen, Siyue Lin, Yuqing Hou, Xiaofeng Mou, Yi Xu

Abstract

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.

View PDFOpen arXiv