Merge-Bench: Resolve Merge Conflicts with Large Language Models

2026-05-25 • Machine Learning

Machine Learning

AI summaryⓘ

The authors created a large dataset called Merge-Bench with thousands of real code conflicts from GitHub to help computers learn how to fix these problems automatically. They trained a special machine learning model named LLMergeJ to resolve conflicts in Java using a reinforcement learning approach. When tested, LLMergeJ performed better than most existing commercial models on Java but still couldn't solve more than about 60% of conflicts. The authors also tested commercial models across many programming languages and found their performance to be fairly consistent.

version controlmerge conflictmachine learninglarge language modelreinforcement learningGitHubJava programmingdatasetcode mergingGroup Relative Policy Optimization (GRPO)

Authors

Benedikt Schesch, Michael D. Ernst

Abstract

This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hunks from 1439 GitHub repositories. The ground truth is the merge resolution that developers committed to the repository. Our dataset construction methodology is scalable to arbitrary amounts of data since no manual labeling is required. (2) We trained a model, LLMergeJ, to resolve merge conflicts in Java programs. Our approach uses Group Relative Policy Optimization (GRPO), an online reinforcement learning method, to train a Large Language Model (LLM). (3) We performed two evaluations of the performance of LLMs on resolving merge conflicts. On Java programs, LLMergeJ with 14B parameters outperforms 3 commercial LLMs, trailing only Gemini 2.5 Pro. Across 11 programming languages, commercial LLM performance is largely stable from language to language. The best models correctly resolve less than 60% of merge conflicts.

View PDFOpen arXiv