Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

2026-07-02Robotics

RoboticsComputer Vision and Pattern RecognitionOperating Systems
AI summary

The authors created Embodied.cpp, a simple and portable tool that helps run AI models controlling vision, language, and actions—like those used in robots and simulators—more efficiently. They noticed that existing tools were made mainly for simple tasks and didn't work well for complex, real-time robotic control on different devices. Their system organizes model execution into layers and focuses on quick, flexible, and modular processing. Testing showed it maintained high success in tasks and used less memory, proving it works well across different AI model types for embodied intelligence.

Embodied AIVision-Language-Action ModelsWorld-Action ModelsInference RuntimeClosed-Loop ControlMulti-Rate ExecutionLatency-First InferenceHeterogeneous HardwareModular ExecutionDeployment Efficiency
Authors
Ling Xu, Chuyu Han, Borui Li, Hao Wu, Shiqi Jiang, Ting Cao, Chuanyou Li, Sheng Zhong, Shuai Wang
Abstract
Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present Embodied.cpp, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, Embodied.cpp captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate Embodied.cpp on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that Embodied.cpp improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.