HI-MoE: Hierarchical Instance-Conditioned Mixture-of-Experts for Object Detection

2026-04-06Machine Learning

Machine Learning
AI summary

The authors propose a new way to improve object detection in images by using a special kind of neural network called Hierarchical Instance-Conditioned Mixture-of-Experts (HI-MoE). Instead of deciding which parts of the model to use based on the whole image or small patches, their method first picks a group of experts based on the whole scene, then assigns each object candidate to some experts. This two-step approach helps the model focus better on individual objects, especially small ones. The authors tested their method on common benchmarks and saw improvements over simpler approaches.

Mixture-of-ExpertsConditional ComputationObject DetectionDETRRoutingInstance QueryScene RouterExpert SpecializationCOCO DatasetLVIS Dataset
Authors
Vadim Vashkelis, Natalia Trukhina
Abstract
Mixture-of-Experts (MoE) architectures enable conditional computation by activating only a subset of model parameters for each input. Although sparse routing has been highly effective in language models and has also shown promise in vision, most vision MoE methods operate at the image or patch level. This granularity is poorly aligned with object detection, where the fundamental unit of reasoning is an object query corresponding to a candidate instance. We propose Hierarchical Instance-Conditioned Mixture-of-Experts (HI-MoE), a DETR-style detection architecture that performs routing in two stages: a lightweight scene router first selects a scene-consistent expert subset, and an instance router then assigns each object query to a small number of experts within that subset. This design aims to preserve sparse computation while better matching the heterogeneous, instance-centric structure of detection. In the current draft, experiments are concentrated on COCO with preliminary specialization analysis on LVIS. Under these settings, HI-MoE improves over a dense DINO baseline and over simpler token-level or instance-only routing variants, with especially strong gains on small objects. We also provide an initial visualization of expert specialization patterns. We present the method, ablations, and current limitations in a form intended to support further experimental validation.