Retrieval-Augmented Fine-Tuning for Improving Retrieve-and-Edit Based Assertion Generation

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-04-07 DOI:10.1109/TSE.2025.3558403

Hongyan Li;Weifeng Sun;Meng Yan;Ling Xu;Qiang Li;Xiaohong Zhang;Hongyu Zhang

{"title":"Retrieval-Augmented Fine-Tuning for Improving Retrieve-and-Edit Based Assertion Generation","authors":"Hongyan Li;Weifeng Sun;Meng Yan;Ling Xu;Qiang Li;Xiaohong Zhang;Hongyu Zhang","doi":"10.1109/TSE.2025.3558403","DOIUrl":null,"url":null,"abstract":"Unit Testing is crucial in software development and maintenance, aiming to verify that the implemented functionality is consistent with the expected functionality. A unit test is composed of two parts: a test prefix, which drives the unit under test to a particular state, and a test assertion, which determines what the expected behavior is under that state. To reduce the effort of conducting unit tests manually, Yu et al. proposed an integrated approach (integration for short), combining information retrieval with a deep learning-based approach to generate assertions for test prefixes, and obtained promising results. In our previous work, we found that the overall performance of integration is mainly due to its success in retrieving assertions. Moreover, integration is limited to specific types of edit operations and struggles to understand the semantic differences between the retrieved focal-test (focal-test includes a test prefix and a unit under test) and the input focal-test. Based on these insights, we then proposed a retrieve-and-edit approach named EditAS to learn the assertion edit patterns to improve the effectiveness of assertion generation in our prior study. Despite being promising, we find that the effectiveness of EditAS can be further improved. Our analysis shows that: ① The editing ability of EditAS still has ample room for improvement. Its performance degrades as the edit distance between the retrieval assertion and ground truth increases. Specifically, the average accuracy of EditAS is <inline-formula><tex-math>$12.38\\%$</tex-math></inline-formula> when the edit distance is greater than 5. ② EditAS lacks a fine-grained semantic understanding of both the retrieved focal-test and the input focal-test themselves, which leads to many inaccurate token modifications. In particular, an average of 25.57% of the incorrectly generated assertions that need to be modified are not modified, and an average of 6.45% of the assertions that match the ground truth are still modified. Thanks to pre-trained models employing pre-training paradigms on large-scale data, they tend to have good semantic comprehension and code generation abilities. In light of this, we propose <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula>, which improves retrieval-and-edit based assertion generation through retrieval-augmented fine-tuning. Specifically, <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> first retrieves a similar focal-test from a predefined corpus and treats its assertion as a prototype. Then, <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> uses a pre-trained model, CodeT5, to learn the semantics of the input and similar focal-tests as well as assertion editing patterns to automatically edit the prototype. We first evaluate the <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> for its inference performance on two large-scale datasets, and the experimental results show that <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> outperforms state-of-the-art assertion generation methods and pre-trained models, with average performance improvements of 15.93%-129.19% and 11.01%-68.88% in accuracy and CodeBLEU, respectively. We also evaluate the performance of <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> in detecting real-world bugs from Defects4J. The experimental results indicate that <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> achieves the best bug detection performance among all the methods.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 5","pages":"1591-1614"},"PeriodicalIF":5.6000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10949862/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Unit Testing is crucial in software development and maintenance, aiming to verify that the implemented functionality is consistent with the expected functionality. A unit test is composed of two parts: a test prefix, which drives the unit under test to a particular state, and a test assertion, which determines what the expected behavior is under that state. To reduce the effort of conducting unit tests manually, Yu et al. proposed an integrated approach (integration for short), combining information retrieval with a deep learning-based approach to generate assertions for test prefixes, and obtained promising results. In our previous work, we found that the overall performance of integration is mainly due to its success in retrieving assertions. Moreover, integration is limited to specific types of edit operations and struggles to understand the semantic differences between the retrieved focal-test (focal-test includes a test prefix and a unit under test) and the input focal-test. Based on these insights, we then proposed a retrieve-and-edit approach named EditAS to learn the assertion edit patterns to improve the effectiveness of assertion generation in our prior study. Despite being promising, we find that the effectiveness of EditAS can be further improved. Our analysis shows that: ① The editing ability of EditAS still has ample room for improvement. Its performance degrades as the edit distance between the retrieval assertion and ground truth increases. Specifically, the average accuracy of EditAS is

$12.38\%$

when the edit distance is greater than 5. ② EditAS lacks a fine-grained semantic understanding of both the retrieved focal-test and the input focal-test themselves, which leads to many inaccurate token modifications. In particular, an average of 25.57% of the incorrectly generated assertions that need to be modified are not modified, and an average of 6.45% of the assertions that match the ground truth are still modified. Thanks to pre-trained models employing pre-training paradigms on large-scale data, they tend to have good semantic comprehension and code generation abilities. In light of this, we propose

$EditAS^{2}$

, which improves retrieval-and-edit based assertion generation through retrieval-augmented fine-tuning. Specifically,

$EditAS^{2}$

first retrieves a similar focal-test from a predefined corpus and treats its assertion as a prototype. Then,

$EditAS^{2}$

uses a pre-trained model, CodeT5, to learn the semantics of the input and similar focal-tests as well as assertion editing patterns to automatically edit the prototype. We first evaluate the

$EditAS^{2}$

for its inference performance on two large-scale datasets, and the experimental results show that

$EditAS^{2}$

outperforms state-of-the-art assertion generation methods and pre-trained models, with average performance improvements of 15.93%-129.19% and 11.01%-68.88% in accuracy and CodeBLEU, respectively. We also evaluate the performance of

$EditAS^{2}$

in detecting real-world bugs from Defects4J. The experimental results indicate that

$EditAS^{2}$

achieves the best bug detection performance among all the methods.

查看原文本刊更多论文

用于改进基于检索和编辑的断言生成的检索增强微调

单元测试在软件开发和维护中是至关重要的，目的是验证实现的功能与预期的功能是一致的。单元测试由两部分组成：测试前缀，它将被测单元驱动到特定的状态，以及测试断言，它决定在该状态下预期的行为是什么。为了减少手工进行单元测试的工作量，Yu等人提出了一种集成方法（简称集成），将信息检索与基于深度学习的方法相结合，为测试前缀生成断言，并获得了很好的结果。在我们之前的工作中，我们发现集成的整体性能主要取决于它在检索断言方面的成功。此外，集成仅限于特定类型的编辑操作，并且难以理解检索的焦点测试（焦点测试包括测试前缀和被测试单元）和输入焦点测试之间的语义差异。基于这些见解，我们提出了一种名为EditAS的检索和编辑方法来学习断言编辑模式，从而在我们之前的研究中提高断言生成的有效性。尽管前景光明，但我们发现EditAS的有效性还可以进一步提高。分析表明：①EditAS的编辑能力还有很大的提升空间。它的性能随着检索断言和基础真值之间的编辑距离的增加而下降。具体来说，当编辑距离大于5时，EditAS的平均准确率为12.38 %。②EditAS缺乏对检索焦点测试和输入焦点测试本身的细粒度语义理解，这导致许多不准确的标记修改。特别是，平均有25.57%的错误生成的需要修改的断言没有被修改，而平均有6.45%的符合基本事实的断言仍然被修改。由于预训练模型在大规模数据上使用预训练范式，它们往往具有良好的语义理解和代码生成能力。鉴于此，我们提出了$EditAS^{2}$，它通过检索增强的微调改进了基于检索和编辑的断言生成。具体来说，$EditAS^{2}$首先从预定义的语料库中检索类似的焦点测试，并将其断言视为原型。然后，$EditAS^{2}$使用预训练的模型CodeT5来学习输入的语义和类似的焦点测试，以及断言编辑模式来自动编辑原型。我们首先对$EditAS^{2}$在两个大规模数据集上的推理性能进行了评估，实验结果表明，$EditAS^{2}$优于最先进的断言生成方法和预训练模型，准确率和CodeBLEU的平均性能分别提高了15.93%-129.19%和11.01%-68.88%。我们还评估了$EditAS^{2}$在检测来自缺陷4j的真实bug方面的性能。实验结果表明，在所有方法中，$EditAS^{2}$的bug检测性能最好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.