THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

2026-06-01 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors study attacks that trick language models over multiple conversation turns by slowly increasing risky content. They argue that looking at each turn alone misses how danger builds up across turns. They propose THRD, a method that watches the entire dialogue history with several parts to spot risks early without retraining the model. Their tests show THRD cuts attack success rates dramatically while keeping the model's normal performance mostly intact. They also found most attacks only become detectable after the second turn, highlighting why tracking conversation history matters.

large language modelsjailbreak attacksmulti-turn conversationrisk assessmentmodel safetytrajectory-dependenttemporal risk accumulationdialogue historyattack success ratemodel utility

Authors

Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, Pengyuan Liu

Abstract

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

View PDFOpen arXiv