Document-Authored Control-Signal Impersonation: A Low-Cost Indirect Prompt Attack on RAG Safety Boundaries

2026-06-08Cryptography and Security

Cryptography and SecurityComputation and Language
AI summary

The authors study a problem in systems that generate text using both user queries and retrieved documents together in one prompt. They show that attackers can trick the system by inserting fake control signals—like labels or metadata—inside retrieved documents, making the system act as if those signals are trustworthy. This trick, called DACSI, is different from direct commands because it hides harmful instructions as harmless metadata within the same text channel. The authors test DACSI on several models and settings, finding varying levels of vulnerability and suggesting that this type of attack needs special attention when designing retrieval-augmented generation systems.

Retrieval-Augmented GenerationPrompt InjectionMetadataControl SignalNatural Language PromptModel VulnerabilitySource-Authority BoundaryProvenanceBehavioral AttributionIndirect Prompt Injection
Authors
Jianguo Zhu
Abstract
Retrieval-augmented generation (RAG) systems often serialize user queries, retrieved documents, metadata, system labels, and task instructions into one natural-language prompt. We study a source-authority boundary failure in this design: attacker-authored retrieved text can impersonate metadata, provenance, authority, or disclosure-policy signals that appear control-relevant to the model. We call this pattern Document-Authored Control-Signal Impersonation (DACSI). DACSI is a non-imperative, metadata-like payload subclass within indirect prompt injection. Its central lesson is simple: document-authored labels are data, not policy. Command-style injection asks the model to ignore, override, or violate policy; DACSI asks whether untrusted document text can be misattributed as an authorized control signal when RAG prompt rendering collapses trusted and untrusted text into the same natural-language channel. We evaluate DACSI across six model settings, prompt-pressure levels, injection baselines, signal taxonomies, RAG-mediated pipelines, system-control probes, a source-authority attribution probe, and synthetic canary formats. We interpret the evidence by model regime rather than as six equal replications: DeepSeek V4 Pro and Qwen3.5-397B provide the cleanest positive lift, DeepSeek V4 Flash is a high-susceptibility setting, GPT-5.5 and Gemini 3.1 Pro Low are strong-boundary probes with selected residual risks, and GLM-4.7 is a saturated leakage boundary case. Across these regimes, DACSI warrants separate evaluation because it uses a command-free metadata/provenance/policy surface, follows a RAG-specific source-authority path, and responds to source/channel separation. The source-authority probe is behavioral attribution evidence, not proof of an internal mechanism.