Multi-LCB: Extending LiveCodeBench to Multiple Programming Languages

2026-06-18Artificial Intelligence

Artificial IntelligenceProgramming Languages
AI summary

The authors created Multi-LCB, a new test that checks how well large language models (LLMs) can write code in twelve different programming languages, not just Python like the original LiveCodeBench (LCB). Multi-LCB converts Python coding problems from LCB into other languages while keeping the same evaluation rules to fairly compare model performance. When testing 24 LLMs, the authors found many models did better in Python, showing they might be too focused on that language and struggle with others. This work points out important challenges in making code-writing AI work well across multiple programming languages.

Large Language ModelsCode GenerationBenchmarkLiveCodeBenchMulti-LCBProgramming LanguagesCross-language EvaluationContamination-aware EvaluationInstruction TuningMultilingual Performance
Authors
Maria Ivanova, Pavel Zadorozhny, Rodion Levichev, Ivan Petrov, Adamenko Pavel, Ivan Lopatin, Alexey Kutalev, Dmitrii Babaev
Abstract
LiveCodeBench (LCB) has recently become a widely adopted benchmark for evaluating large language models (LLMs) on code-generation tasks. By curating competitive programming problems, constantly adding fresh problems to the set, and filtering them by release dates, LCB provides contamination-aware evaluation and offers a holistic view of coding capability. However, LCB remains restricted to Python, leaving open the question of whether LLMs can generalize across the diverse programming languages required in real-world software engineering. We introduce Multi-LCB, a benchmark for evaluating LLMs across twelve programming languages, including Python. Multi-LCB transforms Python tasks from the LCB dataset into equivalent tasks in other languages while preserving LCB's contamination controls and evaluation protocol. Because it is fully compatible with the original LCB format, Multi-LCB will automatically track future LCB updates, enabling systematic assessment of cross-language code generation competence and requiring models to sustain performance well beyond Python. We evaluated 24 LLMs for instruction and reasoning on Multi-LCB, uncovering evidence of Python overfitting, language-specific contamination, and substantial disparities in multilingual performance. Our results establish Multi-LCB as a rigorous new benchmark for multi-programming-language code evaluation, directly addressing LCB's primary limitation and exposing critical gaps in current LLM capabilities.