PHOEBI: An Open-World Benchmark for Bacterial Identification in Phase-Contrast Microscopy

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a large dataset called PHOEBI with images of mixtures of six types of rod-shaped bacteria taken using phase-contrast microscopy. They tested how well computer models could identify species in new bacterial mixtures that the models had never seen before. They found that common methods struggle to recognize these unseen mixtures, not because the image features are bad, but because the way the models make predictions is limited. The authors then developed new lightweight methods that better detect individual species by looking at the image features differently, performing better on unknown mixtures. This work helps improve identifying bacteria in real-world samples where many species appear together.

Optical microscopyPhase-contrast microscopyPolymicrobial samplesSpecies identificationComputer visionOpen-world recognitionF1 scoreBacterial mixturesFeature representationAnchor-based decoders
Authors
Aaditya Baranwal, Md Jahid Hasan, Shruti Vyas
Abstract
Optical microscopy enables rapid, label-free imaging of live bacteria and is the standard instrument for species identification across clinical, environmental, and industrial microbiology. Yet field samples are routinely polymicrobial and may contain organisms that were never seen during system training, and no computer-vision benchmark tests multi-label species identification from phase-contrast microscopy (PCM) of such mixtures. We introduce Phase-contrast Optical bEnchmark for Bacterial Identification ($\textbf{PHOEBI}$), a wet-lab-prepared dataset of $120{,}000$ PCM images covering $40$ combinations of six rod-shaped species, paired with a leave-combinations-out (LCO) evaluation protocol that holds out entire species combinations to mirror the practical scenario of a model trained on catalogued mixtures that must generalise to unseen ones. On LCO, every gradient-trained per-image aggregator we test drops $0.39$ to $0.57$ F1 from the in-distribution to the held-out split, a systematic open-world recognition failure in the aggregator, not the visual representation. A linear probe of thirteen different encoders over the same features spreads only about six percentage points of F1 across general-purpose and biomedical pretraining objectives, confirming the representation is sound. We propose three lightweight $\textit{anchor-based}$ decoders that capture per-species presence geometrically over a shared frozen tile-feature pool, scoring $\textit{higher}$ on held-out combinations than on in-distribution validation.