Assessing Evaluation Metrics for Neural Test Oracle Generation

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2024-07-25 DOI:10.1109/TSE.2024.3433463

Jiho Shin;Hadi Hemmati;Moshi Wei;Song Wang

{"title":"Assessing Evaluation Metrics for Neural Test Oracle Generation","authors":"Jiho Shin;Hadi Hemmati;Moshi Wei;Song Wang","doi":"10.1109/TSE.2024.3433463","DOIUrl":null,"url":null,"abstract":"Recently, deep learning models have shown promising results in test oracle generation. Neural Oracle Generation (NOG) models are commonly evaluated using static (automatic) metrics which are mainly based on textual similarity of the output, e.g. BLEU, ROUGE-L, METEOR, and Accuracy. However, these textual similarity metrics may not reflect the testing effectiveness of the generated oracle within a test suite, which is often measured by dynamic (execution-based) test adequacy metrics such as code coverage and mutation score. In this work, we revisit existing oracle generation studies plus \n<italic>gpt-3.5</i>\n to empirically investigate the current standing of their performance in textual similarity and test adequacy metrics. Specifically, we train and run four state-of-the-art test oracle generation models on seven textual similarity and two test adequacy metrics for our analysis. We apply two different correlation analyses between these two different sets of metrics. Surprisingly, we found no significant correlation between the textual similarity metrics and test adequacy metrics. For instance, \n<italic>gpt-3.5</i>\n on the \n<italic>jackrabbit-oak</i>\n project had the highest performance on all seven textual similarity metrics among the studied NOGs. However, it had the lowest test adequacy metrics compared to all the studied NOGs. We further conducted a qualitative analysis to explore the reasons behind our observations. We found that oracles with high textual similarity metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle's parameters, making them hard for the model to generate completely, affecting the test adequacy metrics. On the other hand, oracles with low textual similarity metrics but high test adequacy metrics tend to have to call different assertion types or a different method that functions similarly to the ones in the ground truth. Overall, this work complements prior studies on test oracle generation with an extensive performance evaluation on textual similarity and test adequacy metrics and provides guidelines for better assessment of deep learning applications in software test generation in the future.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 9","pages":"2337-2349"},"PeriodicalIF":6.5000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10609742/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, deep learning models have shown promising results in test oracle generation. Neural Oracle Generation (NOG) models are commonly evaluated using static (automatic) metrics which are mainly based on textual similarity of the output, e.g. BLEU, ROUGE-L, METEOR, and Accuracy. However, these textual similarity metrics may not reflect the testing effectiveness of the generated oracle within a test suite, which is often measured by dynamic (execution-based) test adequacy metrics such as code coverage and mutation score. In this work, we revisit existing oracle generation studies plus gpt-3.5 to empirically investigate the current standing of their performance in textual similarity and test adequacy metrics. Specifically, we train and run four state-of-the-art test oracle generation models on seven textual similarity and two test adequacy metrics for our analysis. We apply two different correlation analyses between these two different sets of metrics. Surprisingly, we found no significant correlation between the textual similarity metrics and test adequacy metrics. For instance, gpt-3.5 on the jackrabbit-oak project had the highest performance on all seven textual similarity metrics among the studied NOGs. However, it had the lowest test adequacy metrics compared to all the studied NOGs. We further conducted a qualitative analysis to explore the reasons behind our observations. We found that oracles with high textual similarity metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle's parameters, making them hard for the model to generate completely, affecting the test adequacy metrics. On the other hand, oracles with low textual similarity metrics but high test adequacy metrics tend to have to call different assertion types or a different method that functions similarly to the ones in the ground truth. Overall, this work complements prior studies on test oracle generation with an extensive performance evaluation on textual similarity and test adequacy metrics and provides guidelines for better assessment of deep learning applications in software test generation in the future.

查看原文本刊更多论文

评估神经测试 Oracle 生成的评价指标

最近，深度学习模型在测试甲骨文生成方面取得了可喜的成果。神经甲骨文生成（NOG）模型通常使用静态（自动）指标进行评估，这些指标主要基于输出的文本相似性，如 BLEU、ROUGE-L、METEOR 和 Accuracy。然而，这些文本相似度指标可能无法反映测试套件中生成的甲骨文的测试效果，而测试效果通常是通过动态（基于执行）测试充分性指标（如代码覆盖率和突变分数）来衡量的。在这项工作中，我们重新审视了现有的甲骨文生成研究和 gpt-3.5，以实证研究它们在文本相似性和测试充分性指标方面的性能现状。具体来说，我们在七个文本相似性指标和两个测试充分性指标上训练并运行了四个最先进的测试甲骨文生成模型，以进行分析。我们对这两组不同的指标进行了两种不同的相关性分析。令人惊讶的是，我们发现文本相似度指标和测试充分性指标之间没有明显的相关性。例如，在所研究的 NOG 中，jackrabbit-oak 项目上的 gpt-3.5 在所有七个文本相似度指标上的表现都是最高的。但是，与所有研究的 NOG 相比，它的测试充分性指标最低。我们进一步进行了定性分析，以探索观察结果背后的原因。我们发现，文本相似度指标高但测试充分性指标低的神谕往往在神谕参数中具有复杂或多重链式方法调用，这使得模型难以完全生成，从而影响了测试充分性指标。另一方面，文本相似度指标较低但测试充分性指标较高的神谕往往需要调用不同的断言类型，或调用与基本事实中的断言类型功能类似的不同方法。总之，这项工作通过对文本相似性和测试充分性指标进行广泛的性能评估，对之前关于测试神谕生成的研究进行了补充，并为今后更好地评估深度学习在软件测试生成中的应用提供了指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.