{"title":"用于改进基于检索和编辑的断言生成的检索增强微调","authors":"Hongyan Li;Weifeng Sun;Meng Yan;Ling Xu;Qiang Li;Xiaohong Zhang;Hongyu Zhang","doi":"10.1109/TSE.2025.3558403","DOIUrl":null,"url":null,"abstract":"Unit Testing is crucial in software development and maintenance, aiming to verify that the implemented functionality is consistent with the expected functionality. A unit test is composed of two parts: a test prefix, which drives the unit under test to a particular state, and a test assertion, which determines what the expected behavior is under that state. To reduce the effort of conducting unit tests manually, Yu et al. proposed an integrated approach (<i>integration</i> for short), combining information retrieval with a deep learning-based approach to generate assertions for test prefixes, and obtained promising results. In our previous work, we found that the overall performance of <i>integration</i> is mainly due to its success in retrieving assertions. Moreover, <i>integration</i> is limited to specific types of edit operations and struggles to understand the semantic differences between the retrieved focal-test (<i>focal-test</i> includes a test prefix and a unit under test) and the input focal-test. Based on these insights, we then proposed a retrieve-and-edit approach named <small>EditAS</small> to learn the assertion edit patterns to improve the effectiveness of assertion generation in our prior study. Despite being promising, we find that the effectiveness of <small>EditAS</small> can be further improved. Our analysis shows that: ① The editing ability of <small>EditAS</small> still has ample room for improvement. Its performance degrades as the edit distance between the retrieval assertion and ground truth increases. Specifically, the average accuracy of <small>EditAS</small> is <inline-formula><tex-math>$12.38\\%$</tex-math></inline-formula> when the edit distance is greater than 5. ② <small>EditAS</small> lacks a fine-grained semantic understanding of both the retrieved focal-test and the input focal-test themselves, which leads to many inaccurate token modifications. In particular, an average of 25.57% of the incorrectly generated assertions that need to be modified are not modified, and an average of 6.45% of the assertions that match the ground truth are still modified. Thanks to pre-trained models employing pre-training paradigms on large-scale data, they tend to have good semantic comprehension and code generation abilities. In light of this, we propose <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula>, which improves retrieval-and-edit based assertion generation through retrieval-augmented fine-tuning. Specifically, <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> first retrieves a similar focal-test from a predefined corpus and treats its assertion as a prototype. Then, <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> uses a pre-trained model, CodeT5, to learn the semantics of the input and similar focal-tests as well as assertion editing patterns to automatically edit the prototype. We first evaluate the <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> for its inference performance on two large-scale datasets, and the experimental results show that <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> outperforms state-of-the-art assertion generation methods and pre-trained models, with average performance improvements of 15.93%-129.19% and 11.01%-68.88% in accuracy and CodeBLEU, respectively. We also evaluate the performance of <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> in detecting real-world bugs from Defects4J. The experimental results indicate that <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> achieves the best bug detection performance among all the methods.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 5","pages":"1591-1614"},"PeriodicalIF":5.6000,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Retrieval-Augmented Fine-Tuning for Improving Retrieve-and-Edit Based Assertion Generation\",\"authors\":\"Hongyan Li;Weifeng Sun;Meng Yan;Ling Xu;Qiang Li;Xiaohong Zhang;Hongyu Zhang\",\"doi\":\"10.1109/TSE.2025.3558403\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Unit Testing is crucial in software development and maintenance, aiming to verify that the implemented functionality is consistent with the expected functionality. A unit test is composed of two parts: a test prefix, which drives the unit under test to a particular state, and a test assertion, which determines what the expected behavior is under that state. To reduce the effort of conducting unit tests manually, Yu et al. proposed an integrated approach (<i>integration</i> for short), combining information retrieval with a deep learning-based approach to generate assertions for test prefixes, and obtained promising results. In our previous work, we found that the overall performance of <i>integration</i> is mainly due to its success in retrieving assertions. Moreover, <i>integration</i> is limited to specific types of edit operations and struggles to understand the semantic differences between the retrieved focal-test (<i>focal-test</i> includes a test prefix and a unit under test) and the input focal-test. Based on these insights, we then proposed a retrieve-and-edit approach named <small>EditAS</small> to learn the assertion edit patterns to improve the effectiveness of assertion generation in our prior study. Despite being promising, we find that the effectiveness of <small>EditAS</small> can be further improved. Our analysis shows that: ① The editing ability of <small>EditAS</small> still has ample room for improvement. Its performance degrades as the edit distance between the retrieval assertion and ground truth increases. Specifically, the average accuracy of <small>EditAS</small> is <inline-formula><tex-math>$12.38\\\\%$</tex-math></inline-formula> when the edit distance is greater than 5. ② <small>EditAS</small> lacks a fine-grained semantic understanding of both the retrieved focal-test and the input focal-test themselves, which leads to many inaccurate token modifications. In particular, an average of 25.57% of the incorrectly generated assertions that need to be modified are not modified, and an average of 6.45% of the assertions that match the ground truth are still modified. Thanks to pre-trained models employing pre-training paradigms on large-scale data, they tend to have good semantic comprehension and code generation abilities. In light of this, we propose <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula>, which improves retrieval-and-edit based assertion generation through retrieval-augmented fine-tuning. Specifically, <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> first retrieves a similar focal-test from a predefined corpus and treats its assertion as a prototype. Then, <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> uses a pre-trained model, CodeT5, to learn the semantics of the input and similar focal-tests as well as assertion editing patterns to automatically edit the prototype. We first evaluate the <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> for its inference performance on two large-scale datasets, and the experimental results show that <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> outperforms state-of-the-art assertion generation methods and pre-trained models, with average performance improvements of 15.93%-129.19% and 11.01%-68.88% in accuracy and CodeBLEU, respectively. We also evaluate the performance of <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> in detecting real-world bugs from Defects4J. The experimental results indicate that <inline-formula><tex-math>$EditAS^{2}$</tex-math></inline-formula> achieves the best bug detection performance among all the methods.\",\"PeriodicalId\":13324,\"journal\":{\"name\":\"IEEE Transactions on Software Engineering\",\"volume\":\"51 5\",\"pages\":\"1591-1614\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2025-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10949862/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10949862/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Retrieval-Augmented Fine-Tuning for Improving Retrieve-and-Edit Based Assertion Generation
Unit Testing is crucial in software development and maintenance, aiming to verify that the implemented functionality is consistent with the expected functionality. A unit test is composed of two parts: a test prefix, which drives the unit under test to a particular state, and a test assertion, which determines what the expected behavior is under that state. To reduce the effort of conducting unit tests manually, Yu et al. proposed an integrated approach (integration for short), combining information retrieval with a deep learning-based approach to generate assertions for test prefixes, and obtained promising results. In our previous work, we found that the overall performance of integration is mainly due to its success in retrieving assertions. Moreover, integration is limited to specific types of edit operations and struggles to understand the semantic differences between the retrieved focal-test (focal-test includes a test prefix and a unit under test) and the input focal-test. Based on these insights, we then proposed a retrieve-and-edit approach named EditAS to learn the assertion edit patterns to improve the effectiveness of assertion generation in our prior study. Despite being promising, we find that the effectiveness of EditAS can be further improved. Our analysis shows that: ① The editing ability of EditAS still has ample room for improvement. Its performance degrades as the edit distance between the retrieval assertion and ground truth increases. Specifically, the average accuracy of EditAS is $12.38\%$ when the edit distance is greater than 5. ② EditAS lacks a fine-grained semantic understanding of both the retrieved focal-test and the input focal-test themselves, which leads to many inaccurate token modifications. In particular, an average of 25.57% of the incorrectly generated assertions that need to be modified are not modified, and an average of 6.45% of the assertions that match the ground truth are still modified. Thanks to pre-trained models employing pre-training paradigms on large-scale data, they tend to have good semantic comprehension and code generation abilities. In light of this, we propose $EditAS^{2}$, which improves retrieval-and-edit based assertion generation through retrieval-augmented fine-tuning. Specifically, $EditAS^{2}$ first retrieves a similar focal-test from a predefined corpus and treats its assertion as a prototype. Then, $EditAS^{2}$ uses a pre-trained model, CodeT5, to learn the semantics of the input and similar focal-tests as well as assertion editing patterns to automatically edit the prototype. We first evaluate the $EditAS^{2}$ for its inference performance on two large-scale datasets, and the experimental results show that $EditAS^{2}$ outperforms state-of-the-art assertion generation methods and pre-trained models, with average performance improvements of 15.93%-129.19% and 11.01%-68.88% in accuracy and CodeBLEU, respectively. We also evaluate the performance of $EditAS^{2}$ in detecting real-world bugs from Defects4J. The experimental results indicate that $EditAS^{2}$ achieves the best bug detection performance among all the methods.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.