A Universal Data Augmentation Approach for Fault Localization

2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) Pub Date : 2022-05-01 DOI:10.1145/3510003.3510136

Huan Xie, Yan Lei, Meng Yan, Yue Yu, Xin Xia, Xiaoguang Mao

{"title":"A Universal Data Augmentation Approach for Fault Localization","authors":"Huan Xie, Yan Lei, Meng Yan, Yue Yu, Xin Xia, Xiaoguang Mao","doi":"10.1145/3510003.3510136","DOIUrl":null,"url":null,"abstract":"Data is the fuel to models, and it is still applicable in fault localization (FL). Many existing elaborate FL techniques take the code coverage matrix and failure vector as inputs, expecting the techniques could find the correlation between program entities and failures. However, the input data is high-dimensional and extremely imbalanced since the real-world programs are large in size and the number of failing test cases is much less than that of passing test cases, which are posing severe threats to the effectiveness of FL techniques. To overcome the limitations, we propose Aeneas, a universal data augmentation approach that generAtes synthesized failing test cases from reduced feature sace for more precise fault localization. Specifically, to improve the effectiveness of data augmentation, Aeneas applies a revised principal component analysis (PCA) first to generate reduced feature space for more concise representation of the original coverage matrix, which could also gain efficiency for data synthesis. Then, Aeneas handles the imbalanced data issue through generating synthesized failing test cases from the reduced feature space through conditional variational autoencoder (CVAE). To evaluate the effectiveness of Aeneas, we conduct large-scale experiments on 458 versions of 10 programs (from ManyBugs, SIR, and Defects4J) by six state-of-the-art FL techniques. The experimental results clearly show that Aeneas is statistically more effective than baselines, e.g., our approach can improve the six original methods by 89% on average under the Top-1 accuracy.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3510003.3510136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Data is the fuel to models, and it is still applicable in fault localization (FL). Many existing elaborate FL techniques take the code coverage matrix and failure vector as inputs, expecting the techniques could find the correlation between program entities and failures. However, the input data is high-dimensional and extremely imbalanced since the real-world programs are large in size and the number of failing test cases is much less than that of passing test cases, which are posing severe threats to the effectiveness of FL techniques. To overcome the limitations, we propose Aeneas, a universal data augmentation approach that generAtes synthesized failing test cases from reduced feature sace for more precise fault localization. Specifically, to improve the effectiveness of data augmentation, Aeneas applies a revised principal component analysis (PCA) first to generate reduced feature space for more concise representation of the original coverage matrix, which could also gain efficiency for data synthesis. Then, Aeneas handles the imbalanced data issue through generating synthesized failing test cases from the reduced feature space through conditional variational autoencoder (CVAE). To evaluate the effectiveness of Aeneas, we conduct large-scale experiments on 458 versions of 10 programs (from ManyBugs, SIR, and Defects4J) by six state-of-the-art FL techniques. The experimental results clearly show that Aeneas is statistically more effective than baselines, e.g., our approach can improve the six original methods by 89% on average under the Top-1 accuracy.

查看原文本刊更多论文

一种通用的故障定位数据增强方法

数据是模型的燃料，在故障定位(FL)中仍然适用。现有的许多复杂的FL技术以代码覆盖矩阵和故障向量作为输入，期望能够发现程序实体和故障之间的相关性。然而，由于现实世界的程序规模很大，失败测试用例的数量远远少于通过测试用例的数量，因此输入的数据是高维且极不平衡的，这对FL技术的有效性构成了严重的威胁。为了克服局限性，我们提出了一种通用的数据增强方法Aeneas，该方法可以从减少的特征空间中生成合成的失败测试用例，从而更精确地定位故障。具体而言，为了提高数据增强的有效性，Aeneas首先采用修正的主成分分析(PCA)生成约简的特征空间，以更简洁地表示原始覆盖矩阵，从而提高数据合成的效率。然后，Aeneas通过条件变分自编码器(CVAE)从约简的特征空间生成合成的失败测试用例来处理数据不平衡问题。为了评估Aeneas的有效性，我们使用六种最先进的FL技术对10个程序(来自ManyBugs、SIR和Defects4J)的458个版本进行了大规模实验。实验结果清楚地表明，Aeneas在统计上比基线更有效，例如，在Top-1精度下，我们的方法可以将六种原始方法平均提高89%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量