{"title":"OneMoreTest: A Learning-Based Approach to Generating and Selecting Fault-Revealing Unit Tests","authors":"Wei Wei;Yanjie Jiang;Yahui Li;Lu Zhang;Hui Liu","doi":"10.1109/TSE.2025.3581556","DOIUrl":null,"url":null,"abstract":"Developers often manually design a few unit tests for a given method under development. After passing such manually designed tests, however, they usually have to turn to automated test case generation tools like EvoSuite and Randoop for more thorough testing. Although the automatically generated tests may achieve a high coverage, they rarely identify hard-to-detect defects automatically because of the well-known test oracle problem: It is challenging to tell whether the output is correct or incorrect without explicit test oracle (expected output). Consequently, developers should manually select and verify a few suspicious test cases to identify hard-to-detect defects. To this end, in this paper, we propose a novel approach, called <i>OneMoreTest</i>, to generating and selecting the most suspicious tests for manual verification. Based on a manually designed passed test, <i>OneMoreTest</i> automatically generates millions of input-output pairs for the method under test (MUT) with mutation-based fuzzing. It then trains an automatically generated neural network to simulate the MUT’s behavior. For new tests automatically generated for the same MUT, <i>OneMoreTest</i> suggests developers with the top <inline-formula><tex-math>$k$</tex-math></inline-formula> most suspicious tests that have the greatest distances between their actual output and estimated output (i.e., network’s output). Our evaluation on real-world faulty methods suggests that <i>OneMoreTest</i> is accurate. On 70.79% of the involved 178 real-world faulty methods, we can identify the defects by manually verifying only a SINGLE test for each of the methods according to <i>OneMoreTest</i>’s suggestions. Compared against the state of the art, <i>OneMoreTest</i> improved the precision from 46.63% to 72.62%, and recall from 46.63% to 70.79%.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 8","pages":"2346-2365"},"PeriodicalIF":5.6000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11049955/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Developers often manually design a few unit tests for a given method under development. After passing such manually designed tests, however, they usually have to turn to automated test case generation tools like EvoSuite and Randoop for more thorough testing. Although the automatically generated tests may achieve a high coverage, they rarely identify hard-to-detect defects automatically because of the well-known test oracle problem: It is challenging to tell whether the output is correct or incorrect without explicit test oracle (expected output). Consequently, developers should manually select and verify a few suspicious test cases to identify hard-to-detect defects. To this end, in this paper, we propose a novel approach, called OneMoreTest, to generating and selecting the most suspicious tests for manual verification. Based on a manually designed passed test, OneMoreTest automatically generates millions of input-output pairs for the method under test (MUT) with mutation-based fuzzing. It then trains an automatically generated neural network to simulate the MUT’s behavior. For new tests automatically generated for the same MUT, OneMoreTest suggests developers with the top $k$ most suspicious tests that have the greatest distances between their actual output and estimated output (i.e., network’s output). Our evaluation on real-world faulty methods suggests that OneMoreTest is accurate. On 70.79% of the involved 178 real-world faulty methods, we can identify the defects by manually verifying only a SINGLE test for each of the methods according to OneMoreTest’s suggestions. Compared against the state of the art, OneMoreTest improved the precision from 46.63% to 72.62%, and recall from 46.63% to 70.79%.
期刊介绍:
IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include:
a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models.
b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects.
c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards.
d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues.
e) System issues: Hardware-software trade-offs.
f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.