CAREBench: A Child-Safety Risk Benchmark for Language Models

2026-06-29 • Machine Learning

Machine Learning

AI summaryⓘ

The authors created CAREBench, a tool to check if AI language models can spot and avoid risks to children's safety before things get really bad. Instead of focusing only on obvious dangers like abuse content, CAREBench tests if these models can handle tricky situations like online grooming, deception, or making kids too emotionally dependent on AI. They tested seven advanced AI systems and found that some struggled more than others depending on the type of risk. This helps AI developers find weak spots and improve how their models protect children.

Frontier AI systemsChild safetyLanguage modelsGroomingEmotional dependencyDe-escalationAI anthropomorphizationBenchmarkingModel evaluationChild protection

Authors

Kaavya Krishna-Kumar, Elaine Lau, Vaughn Robinson, Jay Caldwell, Sheriff Issaka, Skyler Wang, Francisco Guzmán, Steven Kelling, Jonas Mueller

Abstract

How can we evaluate whether frontier AI systems recognize child-safety risks before they escalate into explicit harm? Existing child safety evaluations focus on child sexual abuse material, yet many child-safety failures begin earlier: in model assistance that helps adults manipulate, impersonate, profile, or isolate minors, and in model responses that deepen children's emotional dependence on AI systems rather than redirecting them toward human support. We introduce CAREBench (Child AI Risk Evaluation), a benchmark to assess such upstream child-safety risks in language models. CAREBench contains 500 prompts spanning twelve risk categories, including grooming and relationship engineering, deception and impersonation, surveillance and privacy, sextortion and sexual abuse, AI anthropomorphization, emotional dependency, and mental illness sensitivity. Developed with response annotations from parents and clinicians, the benchmark excludes explicit abuse material and imagery; instead, it evaluates whether models recognize, refuse, de-escalate, or redirect risky interactions before harm becomes overt. Evaluating seven frontier models on our benchmark, we find failure rates ranging from 2% to 58%, with failure patterns that vary across risk categories. CAREBench provides a responsibly scoped evaluation for LLM developers to identify and close gaps in child safety policies.

View PDFOpen arXiv