自动补丁正确性评估中llm代码自然度建模能力的实证研究

IF 3.1 2区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Automated Software Engineering Pub Date : 2025-04-02 DOI:10.1007/s10515-025-00502-y

Yuning Li, Wenkang Zhong, Zongwen Shen, Chuanyi Li, Xiang Chen, Jidong Ge, Bin Luo

{"title":"自动补丁正确性评估中llm代码自然度建模能力的实证研究","authors":"Yuning Li, Wenkang Zhong, Zongwen Shen, Chuanyi Li, Xiang Chen, Jidong Ge, Bin Luo","doi":"10.1007/s10515-025-00502-y","DOIUrl":null,"url":null,"abstract":"<div><p>Just like natural language, code can exhibit naturalness. This property manifests in highly repetitive patterns within specific contexts. Code naturalness can be captured by language models and then applied to various software engineering tasks (such as fault localization and program repair). Recently, Large Language Models (LLMs) based on Transformers have become advantageous tools for modeling code naturalness. However, existing work lacks systematic studies on the code naturalness modeling capability for LLMs. To bridge this gap, this paper explores the code naturalness modeling capability for LLMs, starting with the task of automated patch correctness assessment. Specifically, we investigate whether LLMs with different architectures and scales, under varying context window sizes, (1) can identify buggy code from common code based on naturalness and consider fixed code more natural than buggy code, and (2) can distinguish different degrees of repairs (i.e., complete repairs and incomplete repairs) from automated tools. Then, we propose metrics to assess the above two capabilities of the models. Experimental results indicate that models with different architectures and scales have the code naturalness modeling capability, even models not specifically pre-trained on code. Additionally, smaller models do not necessarily exhibit weaker modeling capability compared to larger models. We also find more contextual information only provides limited benefits. Based on experimental findings, we select the best performing model that has 220 M parameters to develop an Entropy-based Automated Patch Correctness Assessment (E-APCA) approach by calculating code naturalness. On the large-scale dataset PraPatch, E-APCA surpasses traditional methods by over 20% across various evaluation metrics. Compared to the latest APCA method Entropy-delta based on a 6.7B LLM, E-APCA achieves a 17.32% higher correct patch recall and a 6.83% higher F1 score, while the reasoning time is less than 7% of that required by Entropy-delta.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An empirical study on the code naturalness modeling capability for LLMs in automated patch correctness assessment\",\"authors\":\"Yuning Li, Wenkang Zhong, Zongwen Shen, Chuanyi Li, Xiang Chen, Jidong Ge, Bin Luo\",\"doi\":\"10.1007/s10515-025-00502-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Just like natural language, code can exhibit naturalness. This property manifests in highly repetitive patterns within specific contexts. Code naturalness can be captured by language models and then applied to various software engineering tasks (such as fault localization and program repair). Recently, Large Language Models (LLMs) based on Transformers have become advantageous tools for modeling code naturalness. However, existing work lacks systematic studies on the code naturalness modeling capability for LLMs. To bridge this gap, this paper explores the code naturalness modeling capability for LLMs, starting with the task of automated patch correctness assessment. Specifically, we investigate whether LLMs with different architectures and scales, under varying context window sizes, (1) can identify buggy code from common code based on naturalness and consider fixed code more natural than buggy code, and (2) can distinguish different degrees of repairs (i.e., complete repairs and incomplete repairs) from automated tools. Then, we propose metrics to assess the above two capabilities of the models. Experimental results indicate that models with different architectures and scales have the code naturalness modeling capability, even models not specifically pre-trained on code. Additionally, smaller models do not necessarily exhibit weaker modeling capability compared to larger models. We also find more contextual information only provides limited benefits. Based on experimental findings, we select the best performing model that has 220 M parameters to develop an Entropy-based Automated Patch Correctness Assessment (E-APCA) approach by calculating code naturalness. On the large-scale dataset PraPatch, E-APCA surpasses traditional methods by over 20% across various evaluation metrics. Compared to the latest APCA method Entropy-delta based on a 6.7B LLM, E-APCA achieves a 17.32% higher correct patch recall and a 6.83% higher F1 score, while the reasoning time is less than 7% of that required by Entropy-delta.</p></div>\",\"PeriodicalId\":55414,\"journal\":{\"name\":\"Automated Software Engineering\",\"volume\":\"32 2\",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Automated Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s10515-025-00502-y\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00502-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

就像自然语言一样，代码也可以表现出自然性。此属性在特定上下文中表现为高度重复的模式。代码的自然性可以通过语言模型捕获，然后应用于各种软件工程任务（例如错误定位和程序修复）。近年来，基于transformer的大型语言模型（Large Language Models, llm）已成为对代码自然性进行建模的有利工具。然而，现有工作缺乏对llm代码自然度建模能力的系统研究。为了弥补这一差距，本文从自动补丁正确性评估任务开始，探索了llm的代码自然性建模能力。具体而言，我们研究了在不同上下文窗口大小下，具有不同架构和规模的llm是否能够(1)基于自然性从普通代码中识别有bug的代码，并认为固定代码比有bug的代码更自然，以及(2)能够从自动化工具中区分不同程度的修复（即完全修复和不完全修复）。然后，我们提出度量来评估模型的上述两个功能。实验结果表明，具有不同体系结构和规模的模型具有代码自然度建模能力，即使模型没有经过专门的代码预训练。此外，较小的模型并不一定比较大的模型表现出较弱的建模能力。我们还发现，更多的上下文信息只能提供有限的好处。基于实验结果，我们通过计算代码自然度，选择具有220 M个参数的最佳模型来开发基于熵的自动补丁正确性评估（E-APCA）方法。在大规模数据集PraPatch上，E-APCA在各种评价指标上都比传统方法高出20%以上。与基于6.7B LLM的最新APCA方法相比，E-APCA方法的正确补丁召回率提高了17.32%，F1得分提高了6.83%，而推理时间不到熵-delta方法的7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An empirical study on the code naturalness modeling capability for LLMs in automated patch correctness assessment

Just like natural language, code can exhibit naturalness. This property manifests in highly repetitive patterns within specific contexts. Code naturalness can be captured by language models and then applied to various software engineering tasks (such as fault localization and program repair). Recently, Large Language Models (LLMs) based on Transformers have become advantageous tools for modeling code naturalness. However, existing work lacks systematic studies on the code naturalness modeling capability for LLMs. To bridge this gap, this paper explores the code naturalness modeling capability for LLMs, starting with the task of automated patch correctness assessment. Specifically, we investigate whether LLMs with different architectures and scales, under varying context window sizes, (1) can identify buggy code from common code based on naturalness and consider fixed code more natural than buggy code, and (2) can distinguish different degrees of repairs (i.e., complete repairs and incomplete repairs) from automated tools. Then, we propose metrics to assess the above two capabilities of the models. Experimental results indicate that models with different architectures and scales have the code naturalness modeling capability, even models not specifically pre-trained on code. Additionally, smaller models do not necessarily exhibit weaker modeling capability compared to larger models. We also find more contextual information only provides limited benefits. Based on experimental findings, we select the best performing model that has 220 M parameters to develop an Entropy-based Automated Patch Correctness Assessment (E-APCA) approach by calculating code naturalness. On the large-scale dataset PraPatch, E-APCA surpasses traditional methods by over 20% across various evaluation metrics. Compared to the latest APCA method Entropy-delta based on a 6.7B LLM, E-APCA achieves a 17.32% higher correct patch recall and a 6.83% higher F1 score, while the reasoning time is less than 7% of that required by Entropy-delta.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Automated Software Engineering 工程技术-计算机：软件工程

CiteScore

4.80

自引率

11.80%

发文量

审稿时长

>12 weeks

期刊介绍： This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes. Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.