SentGuard: Sentence-Level Streaming Guardrails for Large Language Models

2026-06-01Computation and Language

Computation and Language
AI summary

The authors present SentGuard, a method to check large language model responses for harmful content one sentence at a time while the model is still generating text. Instead of waiting for the whole response or acting on each word, SentGuard groups words into sentences and verifies them before showing them to users, allowing faster and more stable safety checks. They developed a new dataset, StreamSafe, to help train and test their approach on detecting unsafe content as it appears. Their experiments show SentGuard finds most harmful cases quickly with few false alarms compared to existing methods.

large language modelscontent moderationstreaming generationsentence-level analysissafety benchmarksguardrailstoken-level methodsfalse-positive ratetraining objectivesharm detection
Authors
Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang
Abstract
Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.