From Failure Taxonomy to Intervention: A Diagnostic Methodology for Industry-Scale AVLM in Video and Live-Streaming Platform Moderation

2026-06-29 • Machine Learning

Machine Learning

AI summaryⓘ

The authors explain that big video and live-stream platforms need special models for moderation that fit their unique rules and types of content, which public models often can't handle well. They point out that when these models fail, figuring out the exact problem is hard because different issues can look similar. To help with this, the authors created a step-by-step system to diagnose problems by categorizing them and linking each to specific fixes. They applied this system to build a large model that handles complex video and audio content from many regions around the world.

multimodal modelsvideo moderationmodel failure diagnosisaudio-visual-language modelsmodel alignmentcontent moderationfoundation modelslive streamingmodel interventionsplatform-specific adaptation

Authors

Shuchang Ye, Jinqiang Yu, Zhujun Xiao, Yajing Kong, Yist Y. Lin, Yang Ma, Jiaxi Liu, Xiaolei Xu, Zheng Yu

Abstract

Industry-scale video and live-streaming moderation imposes requirements that are difficult to satisfy with generic pretrained public models or external APIs, including adaptation to platform-specific data distributions, policy-specific objectives, and product-level safety constraints. As a result, platforms must undertake internal model development, naturally turning to shared public research for guidance. However, existing multimodal foundation-model studies primarily report architectures, training recipes, data scaling strategies, and benchmark results, but provide less systematic guidance on how failures should be localized and translated into targeted model-development interventions. Interventions are essential because deployment failures are rarely self-explanatory. Similar failures can originate from different causes. Without targeted interventions, improvement reduces to heuristic trial-and-error, where benchmark improvements are weakly attributable, and failures are difficult to trace to their underlying causes. To address this gap, we present a diagnostic methodology for industry-scale Audio-Visual-Language Models AVLM development. The methodology maps model failures into a taxonomy of observable failure signatures and links each class of failure to an intervention space. We instantiate this methodology across the development and alignment lifecycle of an AVLM foundation model for a large-scale video and live-streaming platform. The resulting system supports over 100 regions and is designed for noisy, ambiguous, and highly diverse content drawn from global platform traffic.

View PDFOpen arXiv