Automated Program Repair in the Era of Large Pre-trained Language Models

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) Pub Date : 2023-05-01 DOI:10.1109/ICSE48619.2023.00129

Chun Xia, Yuxiang Wei, Lingming Zhang

{"title":"Automated Program Repair in the Era of Large Pre-trained Language Models","authors":"Chun Xia, Yuxiang Wei, Lingming Zhang","doi":"10.1109/ICSE48619.2023.00129","DOIUrl":null,"url":null,"abstract":"Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited patch variety, failing to fix complicated bugs. This is mainly due to the reliance on bug-fixing datasets to craft fix templates (traditional) or directly predict potential patches (learning-based). Large Pre-Trained Language Models (LLMs), trained using billions of text/code tokens, can potentially help avoid this issue. Very recently, researchers have directly leveraged LLMs for APR without relying on any bug-fixing datasets. Meanwhile, such existing work either failed to include state-of-the-art LLMs or was not evaluated on realistic datasets. Thus, the true power of modern LLMs on the important APR problem is yet to be revealed. In this work, we perform the first extensive study on directly applying LLMs for APR. We select 9 recent state-of-the-art LLMs, including both generative and infilling models, ranging from 125M to 20B in size. We designed 3 different repair settings to evaluate the different ways we can use LLMs to generate patches: 1) generate the entire patch function, 2) fill in a chunk of code given the prefix and suffix 3) output a single line fix. We apply the LLMs under these repair settings on 5 datasets across 3 different languages and compare different LLMs in the number of bugs fixed, generation speed and compilation rate. We also compare the LLMs against recent state-of-the-art APR tools. Our study demonstrates that directly applying state-of-the-art LLMs can already substantially outperform all existing APR techniques on all our datasets. Among the studied LLMs, the scaling effect exists for APR where larger models tend to achieve better performance. Also, we show for the first time that suffix code after the buggy line (adopted in infilling-style APR) is important in not only generating more fixes but more patches with higher compilation rate. Besides patch generation, the LLMs consider correct patches to be more natural than other ones, and can even be leveraged for effective patch ranking or patch correctness checking. Lastly, we show that LLM-based APR can be further substantially boosted via: 1) increasing the sample size, and 2) incorporating fix template information.","PeriodicalId":376379,"journal":{"name":"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE48619.2023.00129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

Abstract

Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited patch variety, failing to fix complicated bugs. This is mainly due to the reliance on bug-fixing datasets to craft fix templates (traditional) or directly predict potential patches (learning-based). Large Pre-Trained Language Models (LLMs), trained using billions of text/code tokens, can potentially help avoid this issue. Very recently, researchers have directly leveraged LLMs for APR without relying on any bug-fixing datasets. Meanwhile, such existing work either failed to include state-of-the-art LLMs or was not evaluated on realistic datasets. Thus, the true power of modern LLMs on the important APR problem is yet to be revealed. In this work, we perform the first extensive study on directly applying LLMs for APR. We select 9 recent state-of-the-art LLMs, including both generative and infilling models, ranging from 125M to 20B in size. We designed 3 different repair settings to evaluate the different ways we can use LLMs to generate patches: 1) generate the entire patch function, 2) fill in a chunk of code given the prefix and suffix 3) output a single line fix. We apply the LLMs under these repair settings on 5 datasets across 3 different languages and compare different LLMs in the number of bugs fixed, generation speed and compilation rate. We also compare the LLMs against recent state-of-the-art APR tools. Our study demonstrates that directly applying state-of-the-art LLMs can already substantially outperform all existing APR techniques on all our datasets. Among the studied LLMs, the scaling effect exists for APR where larger models tend to achieve better performance. Also, we show for the first time that suffix code after the buggy line (adopted in infilling-style APR) is important in not only generating more fixes but more patches with higher compilation rate. Besides patch generation, the LLMs consider correct patches to be more natural than other ones, and can even be leveraged for effective patch ranking or patch correctness checking. Lastly, we show that LLM-based APR can be further substantially boosted via: 1) increasing the sample size, and 2) incorporating fix template information.

查看原文本刊更多论文

大型预训练语言模型时代的自动程序修复

自动程序修复(APR)旨在帮助开发人员自动修补软件错误。然而，目前最先进的传统和基于学习的APR技术面临着补丁种类有限的问题，无法修复复杂的漏洞。这主要是由于依赖bug修复数据集来制作修复模板(传统的)或直接预测潜在的补丁(基于学习的)。大型预训练语言模型(llm)使用数十亿个文本/代码令牌进行训练，可以潜在地帮助避免这个问题。最近，研究人员直接利用llm来实现APR，而不依赖于任何bug修复数据集。同时，这些现有的工作要么没有包括最先进的法学硕士，要么没有在现实的数据集上进行评估。因此，现代法学硕士在重要的APR问题上的真正力量尚未显露。在这项工作中，我们对直接将llm应用于apr进行了首次广泛的研究。我们选择了9个最新的最先进的llm，包括生成和填充模型，大小从125M到20B不等。我们设计了3种不同的修复设置来评估我们可以使用llm生成补丁的不同方式:1)生成整个补丁功能，2)填充给定前缀和后缀的代码块，3)输出单行修复。我们在这些修复设置下对3种不同语言的5个数据集应用llm，并比较不同llm修复的错误数量、生成速度和编译率。我们还将llm与最近最先进的APR工具进行了比较。我们的研究表明，在我们所有的数据集上，直接应用最先进的llm已经大大超过了所有现有的APR技术。在所研究的llm中，对于APR存在尺度效应，模型越大，性能越好。此外，我们首次表明，在错误行之后的后缀代码(在填充式APR中采用)不仅对生成更多修复程序很重要，而且对生成更多补丁具有更高的编译率。除了补丁生成之外，llm认为正确的补丁比其他补丁更自然，甚至可以用于有效的补丁排序或补丁正确性检查。最后，我们表明基于llm的APR可以通过以下方式进一步大幅提高:1)增加样本量，2)结合固定模板信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量