Automated Test Case Repair Using Language Models

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-02-11 DOI:10.1109/TSE.2025.3541166

Ahmadreza Saboor Yaraghi;Darren Holden;Nafiseh Kahani;Lionel Briand

{"title":"Automated Test Case Repair Using Language Models","authors":"Ahmadreza Saboor Yaraghi;Darren Holden;Nafiseh Kahani;Lionel Briand","doi":"10.1109/TSE.2025.3541166","DOIUrl":null,"url":null,"abstract":"Ensuring the quality of software systems through testing is essential, yet maintaining test cases poses significant challenges and costs. The need for frequent updates to align with the evolving system under test often entails high complexity and cost for maintaining these test cases. Further, unrepaired broken test cases can degrade test suite quality and disrupt the software development process, wasting developers’ time. To address this challenge, we present <sc>TaRGET (<sc>Test Repair GEneraTor), a novel approach leveraging pre-trained code language models for automated test case repair. <sc>TaRGET treats test repair as a language translation task, employing a two-step process to fine-tune a language model based on essential context data characterizing the test breakage. To evaluate our approach, we introduce <sc>TaRBench, a comprehensive benchmark we developed covering 45,373 broken test repairs across 59 open-source projects. Our results demonstrate <sc>TaRGET's effectiveness, achieving a 66.1% exact match accuracy. Furthermore, our study examines the effectiveness of <sc>TaRGET across different test repair scenarios. We provide a practical guide to predict situations where the generated test repairs might be less reliable. We also explore whether project-specific data is always necessary for fine-tuning and if our approach can be effective on new projects.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 4","pages":"1104-1133"},"PeriodicalIF":6.5000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10883022/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Ensuring the quality of software systems through testing is essential, yet maintaining test cases poses significant challenges and costs. The need for frequent updates to align with the evolving system under test often entails high complexity and cost for maintaining these test cases. Further, unrepaired broken test cases can degrade test suite quality and disrupt the software development process, wasting developers’ time. To address this challenge, we present TaRGET (Test Repair GEneraTor), a novel approach leveraging pre-trained code language models for automated test case repair. TaRGET treats test repair as a language translation task, employing a two-step process to fine-tune a language model based on essential context data characterizing the test breakage. To evaluate our approach, we introduce TaRBench, a comprehensive benchmark we developed covering 45,373 broken test repairs across 59 open-source projects. Our results demonstrate TaRGET's effectiveness, achieving a 66.1% exact match accuracy. Furthermore, our study examines the effectiveness of TaRGET across different test repair scenarios. We provide a practical guide to predict situations where the generated test repairs might be less reliable. We also explore whether project-specific data is always necessary for fine-tuning and if our approach can be effective on new projects.

查看原文本刊更多论文

使用语言模型自动修复测试用例

通过测试确保软件系统的质量是必要的，但是维护测试用例带来了巨大的挑战和成本。需要频繁的更新以与测试下不断发展的系统保持一致，这通常需要维护这些测试用例的高复杂性和成本。此外，未修复的损坏测试用例会降低测试套件的质量，扰乱软件开发过程，浪费开发人员的时间。为了应对这一挑战，我们提出了TaRGET（测试修复生成器），这是一种利用预先训练的代码语言模型进行自动化测试用例修复的新方法。TaRGET将测试修复视为一项语言翻译任务，采用两步过程对基于表征测试中断的基本上下文数据的语言模型进行微调。为了评估我们的方法，我们引入了TaRBench，这是我们开发的一个全面的基准测试，涵盖了59个开源项目中的45,373个损坏的测试修复。我们的结果证明了TaRGET的有效性，达到了66.1%的精确匹配准确率。此外，我们的研究考察了TaRGET在不同测试修复方案中的有效性。我们提供了一个实用的指南来预测生成的测试修复可能不太可靠的情况。我们还探讨了特定于项目的数据对于微调是否总是必要的，以及我们的方法是否对新项目有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.