Automating the Design of Embodied AgentArchitectures

2026-06-29 • Robotics

RoboticsArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors explore automating the design of embodied agents—robots or software that perceive, remember, plan, and act in physical environments—using a method called Agent Architecture Search (AAS). They created AgentCanvas, a system to run and visualize these agents, and KDLoop, a way to improve agent designs through cycles of trial and error. Testing different AAS versions on tasks like navigation and question answering, they found automated design can improve agent performance but also faces challenges like noisy feedback and tricky optimization. Their work highlights both the benefits and difficulties of automatically creating agent architectures for real-world tasks.

embodied agentsAgent Architecture Searchperceptionplanningsimulator rolloutsAgentCanvasKDLoopvision-language navigationcredit assignmentarchitecture search

Authors

Jian Zhou, Sihao Lin, Jin Li, Shuai Fu, Gengze Zhou, Qi Wu

Abstract

Embodied agents are typically built as hand-designed compositions of perception, memory, planning, and action modules. This modularity exposes a large architectural design space, but current systems still rely on researcher intuition to choose where information is stored, how observations are processed, and how model calls are connected. Agent Architecture Search (AAS) automates such design for text-domain agents, but has not been systematically evaluated on perceptual embodied agents through simulator rollouts. We study this transfer. We introduce AgentCanvas, a typed-graph runtime that hosts embodied executors as editable node-and-wire programs with simulator-aware execution and episode-level logs, and KDLoop, a coding-agent search procedure that cycles through proposal, critique, experiment, and distillation, with triggered reflection after stalls. We evaluate three AAS variants across four embodied executors spanning vision-language navigation, embodied question answering, and language-conditioned manipulation. The resulting 3x4 matrix shows that architecture-level search can produce deployable and directional success-rate gains on embodied tasks, while one apparent high-scoring candidate is rejected as leak-bearing. At the same time, the experiments expose constraints that are muted in text-domain AAS: optimization signals can be masked by rollout noise, search can become trapped in local edit basins, and episode-level credit assignment only partially emerges even when detailed logs are available. These results characterize both the promise and the current limits of automated architecture search for embodied agents.

View PDFOpen arXiv