An Analysis Focused on Womens Safety: Can VAD Models Be Enhanced by a Multi-modal Dataset?

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created a new video dataset called ExtrAnom to help detect crimes against women, such as stalking and harassment, that are often missed in existing datasets. Their dataset includes 1001 videos with different conditions like low-light and low-resolution to better mimic real-world surveillance. Each video has text descriptions made by both humans and AI to help improve video understanding models. They tested their dataset against popular ones and found current datasets don’t work well for spotting these women-focused crimes. The goal is to improve video anomaly detection specifically for women's safety.

Video Anomaly DetectionWomen-centric CrimeLow-light VideoSurveillance CamerasTextual AnnotationsMulti-modal DatasetVideo-level DescriptionChain SnatchingStalkingLarge Language Models

Authors

Sangeeta, Maddikuntla Sai Prajwal, Debi Prosad Dogra, Kamalakar Vijay Thakare, Hyungjoo Jung, Ig-Jae Kim, Heeseung Choi

Abstract

Women's safety and security are paramount for a modern society. Crimes against women occur in daylight as well as in low-light conditions. Often, such events are captured through real-world surveillance cameras that operate at lower resolutions. Despite substantial progress in CV-related research, video anomaly detection (VAD) focused on women's safety has not yet been adequately addressed. Existing video anomaly datasets contain well-lit, high-resolution, close-shot videos, and fail to represent women-centric anomalies such as chain snatching, stalking, inappropriate touch, and other subtle forms of crime against women. To address these problems, we propose the ExtrAnom dataset, a new multi-modal benchmark containing 1001 videos with textual descriptions, 500 normal and 501 anomalous, classified into 5 different types of women-centric crimes. The dataset comprises low-light (8%), low-resolution videos (13%), long-shot (15%), along with daylight (64%) anomalous videos. And it covers anomalous events like stalking (3.9%), chain snatching (17.6%), kidnapping (7.3%), assassinations (2.3%), harassment (18.9%), and normal (50%). Each video is supplemented with 4 textual annotations, including one human-generated and three LLM-generated descriptions, enabling cross-modal and VLM-based validations. The aim of creating a women-centric dataset is to accurately detect the women-centric anomaly patterns, which are possible to observe visually. The dataset supplements the VLMs to accurately generate video-level descriptions. ExtrAnom has been benchmarked against popular unimodal and multi-modal VAD datasets (e.g., XD-Violence, UCF-Crime, and UCA) and SOTA methods. Experiments reveal that the existing datasets are insufficient to train models for detecting women-centric anomalies.

View PDFOpen arXiv