Uncovering Similar but Different Packages in PyPI and Potential Security Threats

2026-06-29Software Engineering

Software Engineering
AI summary

The authors studied a large part of PyPI, which is a platform for sharing Python packages, and found many copies of existing packages that duplicate most of their code. These copies confuse users and can spread security problems or help create malicious software. They looked at popular, vulnerable, and malicious packages and found many replicated ones that hide security risks and can be used to spread malware. Their work highlights important security risks caused by copied packages on PyPI.

PyPIPython packagescode replicationsoftware vulnerabilitiesmalicious packagespackage distributionsoftware supply chaincode cloningsecurity threatsmalware
Authors
Sunha Park, Soojin Han, Seunghoon Woo
Abstract
In this study, we present a large-scale, in-depth study of package replication in PyPI. As a vital platform, PyPI streamlines Python package distribution for developers. However, beyond small-scale code cloning, we observe that many replicated packages exist on PyPI, which duplicate most of the codebase from existing packages. Such replication not only confuses developers but also propagates known vulnerabilities and enables the creation of new malicious packages. To address this issue, we comprehensively examine the characteristics and potential threats of replicated packages. Using one-third of the entire PyPI repository (200K packages), we investigate replication from three perspectives: replication of popular packages, vulnerable packages, and malicious packages. Our experiments reveal three critical findings about package replication in PyPI: (1) by identifying 1,361 replicated packages of the top 3K popular projects, we show that replication frequently redistributes substantial portions of existing packages under different maintainers; (2) by uncovering 256 previously unknown replicated vulnerable packages, we demonstrate that replication creates vulnerability blind spots that current detection tools rarely catch; (3) by analyzing 3,883 known malicious packages, we found that 186 (4.79%) replicated popular ones, and this pattern further led us to identify seven previously unknown replicated malicious packages, highlighting its role as an attack vector for malware distribution through minor modifications and code injection.