Towards LLM-Based Analysis of Virtualization-Obfuscated Code through Automated Data Generation

2026-05-11 • Cryptography and Security

Cryptography and Security

AI summaryⓘ

The authors look at how software can be intentionally made confusing using virtualization techniques, which creates very large and complex programs that are hard for language AI models to analyze. Instead of trying to understand the entire program's meaning at once, they break it down into smaller parts that the AI can handle and focus on how these parts fit structurally in the program. They built a tool that automatically labels these parts based on their structure, making it easier to create big datasets for training AI. Their initial tests show this method works well on real obfuscated software.

virtualization-based obfuscationlarge language modelsstatic analysisbinary analysissemantic analysisstructural rolesdataset generationsoftware obfuscationprogram decompositionautomatic labeling

Authors

Sangjun An, Hyeyeon Park, Yejin Son, Seoksu Lee, Eun-Sun Cho

Abstract

Virtualization-based obfuscation produces extremely large and structurally complex binaries, posing challenges for LLM-based analysis due to input size limits and the need for large-scale labeled data. We address this by focusing on structural rather than full semantic analysis. Obfuscated binaries are decomposed into the largest semantically coherent units that fit within LLM constraints and are labeled according to their structural roles. We implement a static analysis framework to automate labeling and enable large-scale dataset generation. Our prototype shows strong performance on real-world virtualization obfuscators.

View PDFOpen arXiv