Ensuring Open Source Integrity: The Intersection of Copy-Based Reuse and License Compliance
2026-06-22 • Software Engineering
Software Engineering
AI summaryⓘ
The authors studied how source code is reused across open source projects and how this reuse might violate software licenses. They built a large network showing where code was copied directly between projects and found that many cases could potentially break license rules, especially when licenses are missing or unclear. They also found that permissive licenses like MIT are reused more across different programming languages, while copyleft licenses like GPL have mixed reuse patterns. Existing tools detect only a small fraction of this code copying, making it hard to track compliance accurately.
Source codeOpen source softwareSoftware licensesLicense complianceCode reusePermissive licensesCopyleft licensesDependency analysisSoftware ecosystemsSoftware copyright
Authors
Mahmoud Jahanshahi, Bogdan Vasilescu, Audris Mockus
Abstract
As other creative work, source code is protected by copyright. The owner can license the work, e.g., to permit copy and other kinds of use, and even start legal proceeding against license violators. However, source code can be reused in subtle ways, e.g., via copying without explicit package manager dependencies, making it hard to reason about potential license noncompliance. Using the World of Code infrastructure approximating the entirety of open source software, in this paper we create a copy-based code reuse network mapping direct copying across projects, and use it to quantify the extent of potential license noncompliance across the entire open source ecosystem. In addition, we estimate regression models to understand whether code copying is affected by the origin project's license, and, if so, how it varies with other project characteristics. We find that code in repositories with permissive licenses, such as MIT and Apache, shows higher likelihood of reuse across programming languages. In contrast, copyleft licenses, like the GPL, exhibit mixed effects. Public domain licenses, despite their aim of allowing unrestricted use, are associated with lower likelihood of copy-based reuse. A widespread potential license noncompliance appears to accompany copy-based reuse, with 39.4% of project combinations at potential noncompliance risk, particularly when licenses are unclear or absent. Our findings reveal that only 2.43% of reuse detected through the copy-based network was discoverable via dependency analysis, highlighting the limitations of existing dependency-tracking tools in capturing copy-based reuse.