Detecting Malicious Agent Skills in the Wild using Attention

2026-06-22 • Cryptography and Security

Cryptography and SecurityArtificial Intelligence

AI summaryⓘ

The authors study the security risks from skills, which are sets of natural language instructions that agents use and get from marketplaces. Since skills are made of instructions themselves, bad skills can hide harmful commands that are hard to spot with existing defenses. They created Locate-and-Judge, a two-step method that first finds important parts of a skill and then examines those parts more carefully, making detection faster and more scalable. When tested at large scale, their approach caught many malicious skills missed by other tools, and they shared their labeled dataset for further research.

LLM agentsskillsnatural language instructionsprompt injectionattack surfaceLocate-and-Judgeattention mechanismmalicious skillsmarketplace securitydataset release

Authors

Bacem Etteib, Daniele Lunghi, Tégawendé F. Bissyandé

Abstract

LLM agents increasingly load skills, file-based packages of natural-language instructions written by third parties and distributed through marketplaces, that execute with the user's privileges. A single malicious skill can exfiltrate data, hijack the agent, or persist as a supply-chain foothold, which turns the skill marketplace into a new attack surface for agentic systems. Prompt-injection defenses do not carry over to this setting. They rely on a boundary between trusted instructions and untrusted data, whereas a skill is itself a body of instructions, so an injected command sits among many legitimate ones and inherits their authority. We present Locate-and-Judge, a two-stage detector designed for this regime. A lightweight locator scores the structural spans of a skill by the instruction-following attention each span draws and retains only the top-K. A judge then examines the retained spans in detail. Concentrating the costly judgment on a few high-attention spans lets the detector audit an entire marketplace instead of a sample. Compared to direct LLM-based scanning, this approach offers an order-of-magnitude cost reduction, dramatically increasing its scalability at a small cost to recall, and it dominates keyword and regex baselines at comparable expense. Deployed at marketplace scale and at negligible cost, Locate-and-Judge flags skills with high precision, the majority of which we manually confirmed as malicious, surfacing dozens of live malicious skills, including several disguised as benign functionality and many that SkillSpector and Cisco Skill Scanner fail to detect. We release the resulting labeled dataset.

View PDFOpen arXiv