{"title":"Improving Retrieval-Augmented Deep Assertion Generation via Joint Training","authors":"Quanjun Zhang;Chunrong Fang;Yi Zheng;Ruixiang Qian;Shengcheng Yu;Yuan Zhao;Jianyi Zhou;Yun Yang;Tao Zheng;Zhenyu Chen","doi":"10.1109/TSE.2025.3545970","DOIUrl":null,"url":null,"abstract":"Unit testing attempts to validate the correctness of basic units of the software system under test and has a crucial role in software development and testing. However, testing experts have to spend a huge amount of effort to write unit test cases manually. Very recent work proposes a retrieve-and-edit approach to automatically generate unit test oracles, <italic>i.e.,</i> assertions. Despite being promising, it is still far from perfect due to some limitations, such as splitting assertion retrieval and generation into two separate components without benefiting each other. In this paper, we propose AG-RAG, a retrieval-augmented automated assertion generation (AG) approach that leverages external codebases and joint training to address various technical limitations of prior work. Inspired by the plastic surgery hypothesis, AG-RAG attempts to combine relevant unit tests and advanced pre-trained language models (PLMs) with retrieval-augmented fine-tuning. The key insight of AG-RAG is to simultaneously optimize the retriever and the generator as a whole pipeline with a joint training strategy, enabling them to learn from each other. Particularly, AG-RAG builds a dense retriever to search for relevant test-assert pairs (TAPs) with semantic matching and a retrieval-augmented generator to synthesize accurate assertions with the focal-test and retrieved TAPs as input. Besides, AG-RAG leverages a code-aware language model CodeT5 as the cornerstone to facilitate both assertion retrieval and generation tasks. Furthermore, AG-RAG designs a joint training strategy that allows the retriever to learn from the feedback provided by the generator. This unified design fully adapts both components specifically for retrieving more useful TAPs, thereby generating accurate assertions. AG-RAG is a generic framework that can be adapted to various off-the-shelf PLMs. We extensively evaluate AG-RAG against six state-of-the-art AG approaches on two benchmarks and three metrics. Experimental results show that AG-RAG significantly outperforms previous AG approaches on all benchmarks and metrics, <italic>e.g.,</i> improving the most recent baseline <sc>EditAS</small> by 20.82% and 26.98% in terms of accuracy. AG-RAG also correctly generates 1739 and 2866 unique assertions that all baselines fail to generate, 3.45X and 9.20X more than <sc>EditAS</small>. We further demonstrate the positive contribution of our joint training strategy, <italic>e.g.,</i> AG-RAG improving a variant without the retriever by an average accuracy of 14.11%. Besides, adopting other PLMs can provide substantial advancement, <italic>e.g.,</i> AG-RAG with four different PLMs improving EditAS by an average accuracy of 9.02%, highlighting the generalizability of our framework. Overall, our work demonstrates the promising potential of jointly fine-tuning the PLM-based retriever and generator to predict accurate assertions by incorporating external knowledge sources, thereby reducing the manual efforts of unit testing experts in practical scenarios.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 4","pages":"1232-1247"},"PeriodicalIF":6.5000,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10904092/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Unit testing attempts to validate the correctness of basic units of the software system under test and has a crucial role in software development and testing. However, testing experts have to spend a huge amount of effort to write unit test cases manually. Very recent work proposes a retrieve-and-edit approach to automatically generate unit test oracles, i.e., assertions. Despite being promising, it is still far from perfect due to some limitations, such as splitting assertion retrieval and generation into two separate components without benefiting each other. In this paper, we propose AG-RAG, a retrieval-augmented automated assertion generation (AG) approach that leverages external codebases and joint training to address various technical limitations of prior work. Inspired by the plastic surgery hypothesis, AG-RAG attempts to combine relevant unit tests and advanced pre-trained language models (PLMs) with retrieval-augmented fine-tuning. The key insight of AG-RAG is to simultaneously optimize the retriever and the generator as a whole pipeline with a joint training strategy, enabling them to learn from each other. Particularly, AG-RAG builds a dense retriever to search for relevant test-assert pairs (TAPs) with semantic matching and a retrieval-augmented generator to synthesize accurate assertions with the focal-test and retrieved TAPs as input. Besides, AG-RAG leverages a code-aware language model CodeT5 as the cornerstone to facilitate both assertion retrieval and generation tasks. Furthermore, AG-RAG designs a joint training strategy that allows the retriever to learn from the feedback provided by the generator. This unified design fully adapts both components specifically for retrieving more useful TAPs, thereby generating accurate assertions. AG-RAG is a generic framework that can be adapted to various off-the-shelf PLMs. We extensively evaluate AG-RAG against six state-of-the-art AG approaches on two benchmarks and three metrics. Experimental results show that AG-RAG significantly outperforms previous AG approaches on all benchmarks and metrics, e.g., improving the most recent baseline EditAS by 20.82% and 26.98% in terms of accuracy. AG-RAG also correctly generates 1739 and 2866 unique assertions that all baselines fail to generate, 3.45X and 9.20X more than EditAS. We further demonstrate the positive contribution of our joint training strategy, e.g., AG-RAG improving a variant without the retriever by an average accuracy of 14.11%. Besides, adopting other PLMs can provide substantial advancement, e.g., AG-RAG with four different PLMs improving EditAS by an average accuracy of 9.02%, highlighting the generalizability of our framework. Overall, our work demonstrates the promising potential of jointly fine-tuning the PLM-based retriever and generator to predict accurate assertions by incorporating external knowledge sources, thereby reducing the manual efforts of unit testing experts in practical scenarios.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.