EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers

2026-06-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors tackled the problem of explaining how object detection models identify each object in an image, which is tricky because multiple objects can appear at once. Instead of using slow methods that analyze the model’s gradients or change inputs repeatedly, they created a faster system called EIVE that directly uses attention information from certain detection models to show what parts of the image correspond to each detected object. Their method combines attention signals from multiple layers to produce clear, instance-specific visual explanations without extra computations. They also introduced a training strategy to make these explanations more focused and help the detector perform better. Tests showed their approach gives explanations as good as existing methods but with much higher efficiency.

object detectionvisual explainabilityDetection Transformer (DETR)cross-attentionsaliency mapspost-hoc explanationfeature attributionmodel interpretabilitycross-layer fusionjoint training

Authors

Jianlin Xiang, Yanshan Li, Linhui Dai

Abstract

Visual explainability for object detection remains challenging due to the multi-instance nature of detection. Existing approaches predominantly adopt post-hoc paradigms, such as gradient-based or perturbation-based explanation methods, to interpret pretrained detectors. However, these methods require additional gradient computation or repeated model inference, resulting in limited efficiency. To address this issue, we propose an End-to-end Instance-specific Visual Explanation framework (EIVE) that directly generates instance-level saliency maps following the forward pass of Detection Transformer (DETR)-like models. Specifically, we reformulate the cross-attention mechanism in the decoder as an instance-level feature attribution pathway, so that the cross-attention of each object query corresponds to the visual attribution of its predicted instance. Based on this formulation, we design a cross-layer hybrid consensus fusion (CLHCF) module to aggregate cross-attention signals across decoder layers, producing stable and compact explanations. The explanation process of EIVE requires neither gradient computation nor input perturbation, yielding high computational efficiency, and applies to single- and multi-scale DETR-like object detectors. Finally, we present an attention-aware joint training strategy (AAJTS) as a training-oriented application, which imposes spatial constraints on cross-attention patterns to encourage stable and concentrated attribution representations, thereby improving both interpretability and detection performance. Experiments on MS COCO 2017, ExDark, and Cityscapes demonstrate that EIVE produces high-quality instance-level saliency maps and achieves performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics, while substantially improving explanation efficiency. Code is available at https://github.com/xjlDestiny/EIVE.git.

View PDFOpen arXiv