Leveraging Language Models for Log Statement Generation in Multilingual Scenarios: How Far Are We?

2026-05-25 • Software Engineering

Software Engineering

AI summaryⓘ

The authors studied how well current automatic log statement generators work when dealing with different programming languages, not just one. They tested three top methods and five large language models on a big dataset covering five languages and found that UniLog performed best overall. They also discovered that some languages, like Python, are harder to generate logs for than others, such as JavaScript. The researchers conclude that improving log generation for multiple languages means creating tools that understand the unique ways different languages handle logging, rather than just using bigger models or more data.

log statementssoftware maintenanceautomated log generationmultilingual programmingUniLoglarge language modelsPythonJavaScriptlogging idiomsbenchmark dataset

Authors

Kazuki Kusama, Honglin Shu, Masanari Kondo, Yasutaka Kamei

Abstract

Log statements capture critical information for software maintenance activities such as testing, debugging, and failure analysis. Because of this importance, developers must carefully design log statements, which requires significant effort. To support developers, various end-to-end automated log statement generation approaches have been proposed, whereas these approaches have mainly been evaluated within a single programming language environment and their effectiveness in multilingual environments remains underexplored. In this paper, we therefore comparatively evaluate three state-of-the-art log statement generation approaches and five large language models (LLMs) across multiple programming languages. For this purpose, we constructed a multilingual benchmark comprising 150,000 instances across five programming languages. Our empirical results demonstrate that UniLog, a state-of-the-art approach, achieves the best overall performance, maintaining high effectiveness even in multilingual environments. We also observe substantial variance in the difficulty of log generation across languages: Python presents a greater challenge, whereas JavaScript yields comparatively better performance. Detailed analysis reveals that these disparities stem from variations in log insertion distributions and language-specific logging idioms. Our findings indicate that simply scaling model size or the volume of training data is insufficient for multilingual log generation; rather, designing approaches tailored to the specific characteristics of target languages is crucial. These findings suggest that future automated logging techniques should explicitly account for language-specific logging characteristics to achieve robust performance in multilingual software development environments.

View PDFOpen arXiv