SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning
2026-06-01 • Artificial Intelligence
Artificial IntelligenceComputation and LanguageComputers and Society
AI summaryⓘ
The authors address safety concerns when language-based AI agents use many tools or actions, which can sometimes lead to dangerous behavior or big mistakes. They created SafeMCP, a defense system that predicts and limits risky tool use before problems happen, and can also step in immediately if needed. To develop SafeMCP, they trained it in stages, teaching it about the environment, safe behaviors, and using rewards to improve safety. Tests showed that SafeMCP helps the agents stay safe without reducing their usefulness.
Large Language Model (LLM)Model Context Protocol (MCP)action spacepower-seekinghallucinationstool acquisitionpredictive reasoningreinforcement learningsafe policytwo-tier defense
Authors
Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai
Abstract
As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.