Comparing ML-Specific and General Python Code Smells Across Project Characteristics

2026-06-01Software Engineering

Software Engineering
AI summary

The authors studied how certain qualities of machine learning (ML) projects relate to different kinds of code problems, called code smells, in 279 open-source Python ML projects. They found that problems specific to ML code happen much less often than general Python code issues. The frequency of commits and the project domain influence ML-specific code quality, but other factors like team size or project age do not, which goes against common beliefs about software technical debt. Also, the types of problems affecting general Python code differ by domain from those affecting ML code, suggesting the need for unique quality checks for each. The authors highlight that ML code quality often depends on domain-specific practices and specialized tools, as usual automated checks may miss certain ML-related problems.

machine learning code smellstechnical debtcode qualityCI/CDcommit frequencydomain-specific practicesPython programmingMLOpsReinforcement LearningComputer Vision
Authors
Halimeh Agh, Betül Cimendag, Stefan Wagner
Abstract
Machine learning systems consist of general-purpose code as well as machine-learning-specific code. While ML-specific code smells have been identified, their connection to project characteristics and their interaction with overall code quality are not well understood. Without this knowledge, quality assurance strategies remain one-size-fits-all, failing to account for the contextual factors that drive technical debt in ML systems. We present empirical evidence by examining how six project features (size, age, contributors, commit frequency, CI/CD adoption, and domain) relate to both ML-specific and general Python code quality in 279 open-source ML projects on GitHub. Using CodeSmile for ML code smells and Pylint for general Python smells, our results show: (1) ML code smells are 41-94 times less frequent than general Python smells; (2) commit frequency and domain are significantly associated with ML-specific quality, while project size, team size, age, and CI/CD adoption are not, challenging traditional views on technical debt; (3) general Python smells are not linked to any project characteristic, indicating systemic coding issues that are independent of project context; (4) domains that suffer most from ML-specific smells are not necessarily the same domains that suffer most from general Python smells, necessitating tailored quality strategies for each smell type. MLOps often involves configuration issues, Reinforcement Learning faces challenges with tensor manipulation, and Computer Vision encounters problems with GPU workflows. Overall, ML code quality depends on domain-specific practices and specialized CI/CD quality gates, as standard automation often overlooks domain-specific correctness problems.