Machine Learning Transferability for Malware Detection

2026-03-27Cryptography and Security

Cryptography and SecurityArtificial IntelligenceMachine Learning
AI summary

The authors look at how to better prepare data for machine learning models that detect malware in Portable Executable (PE) files, which are types of files used by Windows programs. They focus on combining features from different datasets to help models work well across various sources. To test this, they trained models using different sets of combined features and checked their performance on several other malware datasets. Their work mainly studies how different ways of preparing the data affect the model's ability to detect malware reliably.

malware detectionmachine learningdata preprocessingPortable Executable (PE)feature compatibilitydataset generalizationdistribution shiftsEMBER datasetBODMASERMDS
Authors
César Vieira, João Vitorino, Eva Maia, Isabel Praça
Abstract
Malware continues to be a predominant operational risk for organizations, especially when obfuscation techniques are used to evade detection. Despite the ongoing efforts in the development of Machine Learning (ML) detection approaches, there is still a lack of feature compatibility in public datasets. This limits generalization when facing distribution shifts, as well as transferability to different datasets. This study evaluates the suitability of different data preprocessing approaches for the detection of Portable Executable (PE) files with ML models. The preprocessing pipeline unifies EMBERv2 (2,381-dim) features datasets, trains paired models under two training setups: EMBER + BODMAS and EMBER + BODMAS + ERMDS. Regarding model evaluation, both EMBER + BODMAS and EMBER + BODMAS + ERMDS models are tested against TRITIUM, INFERNO and SOREL-20M. ERMDS is also used for testing for the EMBER + BODMAS setup.