IndicGuard: A Multilingual Safety Guard Model and Dataset for Indic Languages
2026-06-22 • Computation and Language
Computation and LanguageMachine Learning
AI summaryⓘ
The authors created IndicGuard, a safety tool designed specifically for large language models to understand and respect cultural sensitivities in ten major Indic languages. They built a large dataset capturing region-specific harms and used it to train a language model that can better detect and moderate harmful content in these languages. Their tests show that IndicGuard works better than previous models like CultureGuard and also performs well on some languages it wasn't trained on. This helps make AI language tools safer and more appropriate for diverse Indian language communities.
Large Language Modelssafety guardrailIndic languagescontent moderationinstruction tuningcultural sensitivitymultilingual datasetcross-lingual transferjailbreak attackspolicy compliance
Authors
Parth Bramhecha, Smit Deshmukh, Sairaj Bodhale, Adwait Borate, Raviraj Joshi
Abstract
As Large Language Models (LLMs) achieve widespread integration across diverse linguistic landscapes, ensuring their safety and alignment with regional normative values remains a critical challenge. Current safety mechanisms are predominantly optimized for English-centric frameworks, often failing to capture the unique socio-cultural sensitivities and localized categories of harm inherent to the Indic region. To address this gap, we introduce IndicGuard, a multilingual safety guard model and dataset for Indic languages. We construct a high-volume, culturally nuanced safety dataset encompassing ten major Indic languages, systematically curated to capture regional harms, sensitive socio-political contexts, and adversarial jailbreaks. Leveraging this corpus, we fine-tune a 4B-parameter instruction-tuned model based on Gemma-3-4B-IT to serve as a multilingual safety guardrail for real-time content moderation and policy compliance checking. Our empirical evaluations demonstrate that IndicGuard significantly enhances LLM robustness against localized vulnerabilities, achieving high moderation consistency across different conversational turns. Crucially, IndicGuard consistently outperforms the existing baseline model, CultureGuard, across evaluated languages. Finally, we demonstrate that our model effectively generalizes to low-resource Indic languages excluded from training, substantiating the structural robustness and cross-lingual transfer capabilities of the framework.