DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing
2026-06-15 • Cryptography and Security
Cryptography and SecurityComputation and Language
AI summaryⓘ
The authors address the problem of black-box jailbreaks in large language models, where harmful prompts are hidden by rearranging information rather than removing it. They propose DoubtProbe, a defense method that checks if the structure and meaning of a prompt stay consistent when transformed, combining two techniques: one that looks at the prompt’s structure and one that checks its meaning. Their tests show DoubtProbe reduces the success of attacks while keeping false alarms low, and it works well across different model types. They suggest that looking for structural inconsistencies is a useful way to detect tricky jailbreak attempts.
large language modelsblack-box jailbreakprompt structuresemantic auditingstructural verificationconsistency checkingattack success ratefalse positive rateQwen2.5-72BLlama-3.1-70B
Authors
Xuanyu Yin, Yilin Jiang, Jun Zhou, Kai Chen, Zhengfu Cao, Xiaolei Dong
Abstract
As large language models (LLMs) are increasingly deployed in user-facing systems, black-box jailbreak defense has become an important practical problem. Existing defenses often rely on known-attack coverage, prompt-level semantic judgment, or local runtime control, yet these paths can become unstable under evolving prompt packaging, expression rewriting, and structure manipulation. We observe that many black-box jailbreaks do not remove the harmful goal, but reorganize the information needed to express and execute it, thereby evading safety alignment while remaining recoverable during generation. Motivated by this observation, we propose DoubtProbe, a dual-branch inference-time defense framework that combines structural verification with semantic auditing and formulates black-box jailbreak defense as consistency checking under controlled transformation. The structural branch extracts a structured representation from the original request, reconstructs the request under representation constraints, and detects information-preservation failures between the original and reconstructed requests; the semantic branch audits the original prompt directly. We evaluate DoubtProbe against representative black-box defenses on jailbreak and benign-request benchmarks, and further test backbone transfer from Qwen2.5-72B to Llama-3.1-70B. Results show that DoubtProbe achieves a stronger and more stable defense-utility trade-off: on Qwen2.5-72B, it reduces the JBB attack success rate from 0.293 to 0.100 and the CodeAttack attack success rate from 0.152 to 0.001, while maintaining false positive rates of 0.022 and 0.016 on AlpacaEval and OR-Bench; the same pattern remains stable on Llama-3.1-70B. These findings show that structural inconsistency signals provide a practical and generalizable basis for black-box jailbreak defense, especially when combined with semantic auditing.