Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

2026-05-04 • Artificial Intelligence

Artificial IntelligenceMachine LearningSoftware Engineering

AI summaryⓘ

The authors study how to find similar pieces of code written in different programming languages, which is hard because the code looks very different even if it does the same thing. They improve smaller, open-source models by teaching them using a larger, smarter model through a process called knowledge distillation. Their method also includes ways to make model answers more consistent and faster to produce. Tests on several language pairs show their approach helps these smaller models become more reliable and accurate. This makes smaller models more useful for detecting similar code across languages.

Cross-language code clone detectionKnowledge distillationLarge language modelsOpen-source modelsReasoning-oriented promptsBinary classificationLoRA adaptersCross-language code pairsResponse stabilizationDistribution shift

Authors

Mohamad Khajezade, Fatemeh H. Fard, Mohamed Sami Shehata

Abstract

Cross-language code clone detection (X-CCD) is challenging because semantically equivalent programs written in different languages often share little surface similarity. Although large language models (LLMs) have shown promise for semantic clone detection, their use as black-box systems raises concerns about cost, reproducibility, privacy, and unreliable output formatting. In particular, compact open-source models often struggle to follow reasoning-oriented prompts and to produce outputs that can be consistently mapped to binary clone labels. To address these limitations, we propose a knowledge distillation framework that transfers reasoning capabilities from DeepSeek-R1 into compact open-source student models for X-CCD. Using cross-language code pairs derived from Project CodeNet, we construct reasoning-oriented synthetic training data and fine-tune Phi3 and Qwen-Coder with LoRA adapters. We further introduce response stabilization methods, including forced conclusion prompting, a binary classification head, and a contrastive classification head, and evaluate model behavior using both predictive metrics and response rate. Experiments on Python--Java, Rust--Java, Rust--Python, and Rust--Ruby show that knowledge distillation consistently improves the reliability of compact models and often improves predictive performance, especially under distribution shift. In addition, classification-head variants substantially reduce inference time compared to generation-based inference. Overall, our results show that reasoning-oriented distillation combined with response stabilization makes compact open-source models more practical and reliable for X-CCD detection.

View PDFOpen arXiv