NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense

2026-06-02 • Cryptography and Security

Cryptography and SecurityArtificial Intelligence

AI summaryⓘ

The authors address how large language models can be tricked into harmful behavior through hidden or sneaky requests, a problem called jailbreak attacks. They introduce NeuroArmor, a method that checks each input against several safe versions of the same prompt to spot if something suspicious is going on. When NeuroArmor detects possible harm, it either blocks the response or tries to give a safe but helpful reply. Their tests show this approach greatly reduces harmful responses while making fewer mistakes in blocking harmless ones compared to other methods. This technique helps balance safety and usefulness in real-time use of language models.

large language modelsjailbreak attacksprompt engineeringruntime defensesafe variantshidden-state spacefalse positive rateattack success rateprompt-specific consistencyselective intervention

Authors

Zhongyang Lin, Ziran Zhao, Feifei Zhai, Pengyuan Liu

Abstract

Large language models remain vulnerable to jailbreak attacks that hide harmful intent behind seemingly ordinary requests such as role-play, translation, encoding, adversarial suffixes, and multi-turn buildup. Existing defenses still struggle to handle these attacks without over-blocking benign but sensitive requests, partly because they often apply the same action to every prompt and therefore fail to balance safety and helpfulness. We propose NeuroArmor, a white-box runtime defense that uses prompt-specific safe variants as a local safety reference for deciding when intervention is needed and, once triggered, as safe targets for intervention. For each prompt, NeuroArmor builds K safe variants, compares the prompt state against this local safe reference in hidden-state space, and routes anomalies either to a refusal branch for malicious prompts or to a helpful recovery branch for borderline benign prompts. On Llama-3-8B-Instruct, NeuroArmor reduces malicious attack success rate (ASR) from 41.56% to 1.57% while lowering benign false positive rate (FPR) on the shared benign pool from 30.26% to 22.05%; matched baselines remain substantially weaker on this trade-off. External-judge and manual behavioral evaluations further show that the remaining non-blocked outputs are much less likely to be operationally harmful. Overall, NeuroArmor provides a more effective runtime strategy for jailbreak defense by combining prompt-specific consistency checking, routing, and selective intervention.

View PDFOpen arXiv