ContextShift: A Controlled Benchmark for Context Dependence in Object Detection

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a new test called ContextShift to see how well object detectors work when the background or context around objects changes, without changing the objects themselves. They found that most detectors miss more objects (false negatives) but don't increase false alarms when the context shifts, a problem not shown by usual accuracy measures like AP. The performance changes in complex ways depending on how compatible the object and context are statistically, and not just based on confidence scores. They also found that training detectors with context-changing examples helps them handle such variations better.

object detectioncontextual variationfalse negativesaverage precision (AP)COCO datasetfalse positivespointwise mutual information (NPMI)data augmentationrobustness
Authors
Dan Zlotnikov, Alex Lazarovich, Ohad Ben-Shahar
Abstract
Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual variation remains insufficiently understood. Prior evaluations largely rely on aggregate metrics such as AP on uncontrolled distribution shifts, which can obscure how performance degrades under context change. We introduce ContextShift, a controlled benchmark that systematically manipulates object--context relationships while preserving object appearance. Built on COCO 2017, it isolates context as an independent variable through geometric transformations and synthetic and natural background substitutions, including a continuous compatibility axis based on normalized pointwise mutual information (NPMI). Across diverse detector architectures, we observe a consistent degradation pattern: false negatives increase by up to 227% and prediction volume decreases by up to 44%, while false positives remain stable or decline. This suppression behavior is not captured by aggregate metrics such as AP, which can mask substantial recall loss and changes in prediction dynamics. Further analysis suggests that degradation is driven less by reduced confidence than by a reduced formation of valid detection candidates. Moreover, performance along the statistical compatibility axis is non-monotonic, peaking at intermediate NPMI and degrading toward both extremes, indicating that statistical co-occurrence does not correlate linearly with effective visual context. Finally, we show that context-aware augmentation improves robustness: every augmented variant outperforms the dataset-only baseline on both original and manipulated test images, partially recovering performance lost to prediction-suppression failures by exposing models to object--context decoupling during training.