Automated jailbreak attack targeting multiple defense strategies

2026-06-15Cryptography and Security

Cryptography and SecurityArtificial Intelligence
AI summary

The authors created UNIATTACK, a new tool that tests how easily large language models (LLMs) can be tricked by harmful prompts. Unlike other methods that use fixed templates or need lots of tuning, UNIATTACK finds and combines the most effective parts of known attacks to make one-time prompts that work on many models. Their tests show UNIATTACK is much more successful and efficient at fooling protected LLMs compared to earlier methods. This helps researchers check how safe these language models really are.

large language modelsadversarial attacksblack-box attacksprompt engineeringmodel robustnessattack success ratedefense mechanismsautomated refinementfeature extractionone-shot attacks
Authors
Qi Wang, Chengcheng Wan, Weijia He, Yanqing Li, Hanqi Sun, Xiaodong Gu, Jiangtao Wang
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extracts minimal but high-impact attack features from diverse existing attacks, optimizes them via a specialized attacker LLM, and composes them into flexible templates through automated refinement process. This feature-centric construction enables one-shot attacks that generalize across multiple models and safety categories, providing a practical tool for assessing LLM robustness. Our evaluation results shows that compared to the baselines, UNIATTACK achieves an average attack success rate (ASR) improvement of 64.63\%-248.82\% on models deployed with multi-layered defense mechanisms and it only takes 0.03\%-4.96\% cost of the baselines. UNIATTACK artifact is available at https://anonymous.4open.science/r/UniAttack-Artifact-30F1.