EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
2026-04-10 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors present EgoTL, a new method to better understand and label everyday household activities recorded from a first-person view. Existing models struggle because the data they use lacks detailed action labels and spatial information, causing errors in planning and understanding tasks. EgoTL records spoken explanations before actions, captures precise timing and spatial details, and organizes instructions to improve task understanding. Using EgoTL, the authors evaluate and improve large models on complex household tasks, finding that current foundation models still have limitations but can be improved by training with EgoTL's detailed data.
foundation modelsembodied intelligenceegocentric datachain-of-thought (CoT)spatial groundingauto-labelinglong-horizon planningthink-aloud protocolmetric-scale spatial estimationinstruction following
Authors
Lulin Liu, Dayou Li, Yiqing Liang, Sicong Jiang, Hitesh Vijay, Hezhen Hu, Xuhai Xu, Zirui Liu, Srinivas Shakkottai, Manling Li, Zhiwen Fan
Abstract
Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.