Building an Adversarial Malware Dataset by Family and Type: Generation, Evasion, and Poisoning Evaluation

2026-05-25 • Cryptography and Security

Cryptography and SecurityMachine Learning

AI summaryⓘ

The authors created a large collection of tricky malware files designed to fool a malware detector called EMBER. They generated over 44,000 samples labeled by malware family and over 33,000 by type, with high success in evading detection. They also showed that if a small number of these tricky samples are added to the training data, the detector becomes much easier to fool. The authors released this dataset so others can study how to make malware detectors more secure against such attacks.

adversarial malwarePE filesEMBER classifiermalware evasiondata poisoningmachine learningmalware detectionVirusTotaltraining dataclassification robustness

Authors

David Košťál, Martin Jureček

Abstract

We present a dataset of adversarial malware samples derived from the public RawMal-TF collection of real-world malware binaries. Using a suite of adversarial malware generators, we construct two sets of adversarial PE files: 44,347 family-labelled samples and 33,596 type-labelled samples, achieving evasion rates of 98.35 % and 92.20 % against the EMBER classifier, respectively. Each adversarial binary is accompanied by detailed metadata, including EMBER scores and VirusTotal classifications. We further demonstrate the susceptibility of malware classification pipelines to data poisoning attacks through a series of training experiments. Injecting fully mislabelled adversarial samples representing only 0.5 % of the training data in the family-labelled dataset increases the evasion rate against the re-trained classifier from 26.1 % to 92.8 %. The dataset is publicly released to facilitate future research on adversarial malware, poisoning attacks, and the robustness of machine-learning-based malware detection systems.

View PDFOpen arXiv