Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization

2026-06-01 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors address a challenge in semiconductor inspection where getting precise training images is expensive. They create a system where a Vision-Language Model (VLM) turns images of circuits into editable code that describes the circuit shapes exactly, allowing for precise control when making training data. Since their model trains on synthetic images but real microscope images look different, they use a method to convert real images into simple black-and-white forms, removing noisy details. This helps the model focus on shapes rather than textures and improves accuracy in their tests. Overall, their approach helps generate accurate training data from real images more effectively.

Vision-Language ModelDomain-Specific LanguageSemiconductor InspectionGenerative ModelsData AugmentationScanning Electron MicroscopeSim-to-Real GapBinarizationCircuit GeometryDice Coefficient

Authors

Yusuke Ohtsubo, Kota Dohi, Koichiro Yawata, Koki Takeshita, Tatsuya Sasaki

Abstract

Precise parametric control over circuit geometry is essential for semiconductor inspection, yet obtaining sufficient real training data remains costly. Although generative models such as diffusion models and Generative Adversarial Networks (GANs) can augment training data, they cannot guarantee the nanometer-scale geometric accuracy required for metrology tasks. We propose a visual program synthesis framework in which a Vision-Language Model (VLM) converts inspection images into editable Domain-Specific Language (DSL) code describing circuit geometries, enabling controlled generation of training data with exact parameter manipulation. Because the VLM is trained solely on synthetic DSL-rendered data, a domain gap arises when processing real Scanning Electron Microscope (SEM) images. We bridge this gap with an input binarization strategy that strips SEM-specific texture and noise, letting the model focus on geometric structure. On the MIIC dataset, binarized inputs improve the mean Dice coefficient from 0.4393 to 0.5256 over the raw-input baseline, demonstrating that simple texture abstraction substantially mitigates the sim-to-real gap.

View PDFOpen arXiv