OneMoreTest: A Learning-Based Approach to Generating and Selecting Fault-Revealing Unit Tests

IF 5.6 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-06-25 DOI:10.1109/TSE.2025.3581556

Wei Wei;Yanjie Jiang;Yahui Li;Lu Zhang;Hui Liu

{"title":"OneMoreTest: A Learning-Based Approach to Generating and Selecting Fault-Revealing Unit Tests","authors":"Wei Wei;Yanjie Jiang;Yahui Li;Lu Zhang;Hui Liu","doi":"10.1109/TSE.2025.3581556","DOIUrl":null,"url":null,"abstract":"Developers often manually design a few unit tests for a given method under development. After passing such manually designed tests, however, they usually have to turn to automated test case generation tools like EvoSuite and Randoop for more thorough testing. Although the automatically generated tests may achieve a high coverage, they rarely identify hard-to-detect defects automatically because of the well-known test oracle problem: It is challenging to tell whether the output is correct or incorrect without explicit test oracle (expected output). Consequently, developers should manually select and verify a few suspicious test cases to identify hard-to-detect defects. To this end, in this paper, we propose a novel approach, called OneMoreTest, to generating and selecting the most suspicious tests for manual verification. Based on a manually designed passed test, OneMoreTest automatically generates millions of input-output pairs for the method under test (MUT) with mutation-based fuzzing. It then trains an automatically generated neural network to simulate the MUT’s behavior. For new tests automatically generated for the same MUT, OneMoreTest suggests developers with the top <inline-formula><tex-math>$k$</tex-math></inline-formula> most suspicious tests that have the greatest distances between their actual output and estimated output (i.e., network’s output). Our evaluation on real-world faulty methods suggests that OneMoreTest is accurate. On 70.79% of the involved 178 real-world faulty methods, we can identify the defects by manually verifying only a SINGLE test for each of the methods according to OneMoreTest’s suggestions. Compared against the state of the art, OneMoreTest improved the precision from 46.63% to 72.62%, and recall from 46.63% to 70.79%.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2346-2365"},"PeriodicalIF":5.6000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11049955/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Developers often manually design a few unit tests for a given method under development. After passing such manually designed tests, however, they usually have to turn to automated test case generation tools like EvoSuite and Randoop for more thorough testing. Although the automatically generated tests may achieve a high coverage, they rarely identify hard-to-detect defects automatically because of the well-known test oracle problem: It is challenging to tell whether the output is correct or incorrect without explicit test oracle (expected output). Consequently, developers should manually select and verify a few suspicious test cases to identify hard-to-detect defects. To this end, in this paper, we propose a novel approach, called OneMoreTest, to generating and selecting the most suspicious tests for manual verification. Based on a manually designed passed test, OneMoreTest automatically generates millions of input-output pairs for the method under test (MUT) with mutation-based fuzzing. It then trains an automatically generated neural network to simulate the MUT’s behavior. For new tests automatically generated for the same MUT, OneMoreTest suggests developers with the top

$k$

most suspicious tests that have the greatest distances between their actual output and estimated output (i.e., network’s output). Our evaluation on real-world faulty methods suggests that OneMoreTest is accurate. On 70.79% of the involved 178 real-world faulty methods, we can identify the defects by manually verifying only a SINGLE test for each of the methods according to OneMoreTest’s suggestions. Compared against the state of the art, OneMoreTest improved the precision from 46.63% to 72.62%, and recall from 46.63% to 70.79%.

查看原文本刊更多论文

OneMoreTest：生成和选择故障揭示单元测试的基于学习的方法

开发人员经常为正在开发的给定方法手工设计一些单元测试。然而，在通过这种手工设计的测试之后，他们通常不得不转向自动化的测试用例生成工具，如EvoSuite和Randoop，以进行更彻底的测试。尽管自动生成的测试可能达到很高的覆盖率，但由于众所周知的测试oracle问题，它们很少自动识别难以检测的缺陷：如果没有显式的测试oracle（预期的输出），判断输出是否正确是具有挑战性的。因此，开发人员应该手动选择并验证一些可疑的测试用例，以识别难以检测的缺陷。为此，在本文中，我们提出了一种新的方法，称为OneMoreTest，以生成和选择最可疑的测试进行手动验证。OneMoreTest基于人工设计的通过测试，通过基于突变的模糊测试，自动为被测方法（MUT）生成数百万个输入输出对。然后，它训练一个自动生成的神经网络来模拟MUT的行为。对于为相同MUT自动生成的新测试，OneMoreTest建议开发人员使用前k个最可疑的测试，这些测试的实际输出和估计输出（即网络的输出）之间的距离最大。我们对现实世界错误方法的评估表明OneMoreTest是准确的。在涉及的178个真实世界错误方法的70.79%中，我们可以根据OneMoreTest的建议，通过手动验证每个方法的单个测试来识别缺陷。与现有技术相比，OneMoreTest将准确率从46.63%提高到72.62%，召回率从46.63%提高到70.79%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.