Improving LLM-Based Go Code Review through Issue-List Generation and Context Augmentation

2026-06-01Software Engineering

Software Engineering
AI summary

The authors studied how to make large language models (LLMs) better at reviewing computer code by changing the way issues are reported and adding extra context from related code. They introduced an issue-list review method where the model lists multiple potential problems instead of just one. They also tested different ways to provide context and combined these with a pruning technique to keep suggestions manageable. Their method improved the match with real human code changes compared to previous models and got closer to human reviewers. This approach helps generate more helpful and concise code review comments automatically.

Large Language Models (LLMs)code reviewgeneration strategycontext augmentationissue-list reviewcode refinementsemantic contextcandidate pruningevaluation metricssoftware engineering
Authors
Kexin Sun, Yucong Guan, Jiaqi Sun, Hongyu Kuang, Guoping Rong, Dong Shao, He Zhang, Xiaoxing Ma, Christoph Treude
Abstract
LLMs have shown strong potential for automating code review, yet their practical utility depends heavily on the design of generation and context strategies. In this paper, we investigate how to improve LLM-based code review through generation strategy and contextual augmentation. We first propose an issue-list review paradigm, in which LLMs enumerate all potential issues rather than reporting only the single most important one (i.e., primary-issue review). We then systematically compare three types of code context augmentation -- neighboring, LSP-based semantics, and IR-based similar co-change context -- and study how they influence issue discovery. Finally, we integrate candidates from no-context and context-enhanced generation to improve review coverage, and introduce refinement-guided pruning to keep the candidate list at a practical size. We evaluate our approach on 1,438 Go review instances using downstream code refinement as the main metric, i.e., how often the candidate list contains at least one comment inducing the same code change as the final human revision. For comparison, we evaluate comments by CodeReviewer, a model trained specifically for review comment generation, as well as ground-truth human review comments (as a practical upper bound), under the same refinement-based evaluation. The results show that our best configuration, combining issue-list review, neighboring and similar co-change context, and candidate integration, reaches 28.00% refinement exact match, a statistically significant gain of +10.85 percentage points over primary-issue review without any additional context (17.15%), substantially outperforming CodeReviewer (15.02%) and approaching the human-oracle ceiling of 36.09%. Our refinement-guided pruning reduces the average candidate count from 7.2 to 3.1 at top-5 while retaining nearly the full benefit, making the candidate list easier to inspect.