MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens
2026-06-15 • Cryptography and Security
Cryptography and SecurityArtificial Intelligence
AI summaryⓘ
The authors created a new dataset called MASCOT-Android that has source code from Android malware, which is helpful because source code shows what attackers intended more clearly than other types of code. Since source code for malware is rare and hard to find, they made an automated way to gather it by looking at documentation files in GitHub repositories. They trained a model that reads README files and can tell if a repo has malware code with about 96% accuracy and a low false positive rate. This method lets users change how strict the detection is to find a good balance between catching malware and avoiding mistakes.
Android malwaresource codeGitHub repositoriesREADME filesTF-IDFLinearSVCmalware detectionmachine learning classificationfalse positive ratedataset curation
Authors
Bojing Li, Duo Zhong, Prajna Bhandary, Raguvir S, Charles Maxa, Robert J Joyce, Charles Nicholas
Abstract
Compared with binaries and decompiled code, malware source code more directly reflects the attackers' original intent. However, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. We propose MASCOT-Android, a curated dataset of Android malware source code and an automated collection framework for scalable malware source code discovery on GitHub. A key finding of our work is that repository-level documentation alone provides a strong signal for malware source code collection. Our model extracts character-level TF-IDF features from 8,772 malware and 25,747 benign README documents and trains a LinearSVC classifier to distinguish malware repositories. This README-only model achieves an accuracy of 96.28\% and an FPR of 1.06\% in local evaluation. In addition, the model outputs confidence scores, allowing users to adjust the decision threshold to balance FPR and coverage, which is practical in real-world malware source code collection.