A Universal Data Augmentation Approach for Fault Localization

Huan Xie, Yan Lei, Meng Yan, Yue Yu, Xin Xia, Xiaoguang Mao
{"title":"A Universal Data Augmentation Approach for Fault Localization","authors":"Huan Xie, Yan Lei, Meng Yan, Yue Yu, Xin Xia, Xiaoguang Mao","doi":"10.1145/3510003.3510136","DOIUrl":null,"url":null,"abstract":"Data is the fuel to models, and it is still applicable in fault localization (FL). Many existing elaborate FL techniques take the code coverage matrix and failure vector as inputs, expecting the techniques could find the correlation between program entities and failures. However, the input data is high-dimensional and extremely imbalanced since the real-world programs are large in size and the number of failing test cases is much less than that of passing test cases, which are posing severe threats to the effectiveness of FL techniques. To overcome the limitations, we propose Aeneas, a universal data augmentation approach that generAtes synthesized failing test cases from reduced feature sace for more precise fault localization. Specifically, to improve the effectiveness of data augmentation, Aeneas applies a revised principal component analysis (PCA) first to generate reduced feature space for more concise representation of the original coverage matrix, which could also gain efficiency for data synthesis. Then, Aeneas handles the imbalanced data issue through generating synthesized failing test cases from the reduced feature space through conditional variational autoencoder (CVAE). To evaluate the effectiveness of Aeneas, we conduct large-scale experiments on 458 versions of 10 programs (from ManyBugs, SIR, and Defects4J) by six state-of-the-art FL techniques. The experimental results clearly show that Aeneas is statistically more effective than baselines, e.g., our approach can improve the six original methods by 89% on average under the Top-1 accuracy.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3510003.3510136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

Data is the fuel to models, and it is still applicable in fault localization (FL). Many existing elaborate FL techniques take the code coverage matrix and failure vector as inputs, expecting the techniques could find the correlation between program entities and failures. However, the input data is high-dimensional and extremely imbalanced since the real-world programs are large in size and the number of failing test cases is much less than that of passing test cases, which are posing severe threats to the effectiveness of FL techniques. To overcome the limitations, we propose Aeneas, a universal data augmentation approach that generAtes synthesized failing test cases from reduced feature sace for more precise fault localization. Specifically, to improve the effectiveness of data augmentation, Aeneas applies a revised principal component analysis (PCA) first to generate reduced feature space for more concise representation of the original coverage matrix, which could also gain efficiency for data synthesis. Then, Aeneas handles the imbalanced data issue through generating synthesized failing test cases from the reduced feature space through conditional variational autoencoder (CVAE). To evaluate the effectiveness of Aeneas, we conduct large-scale experiments on 458 versions of 10 programs (from ManyBugs, SIR, and Defects4J) by six state-of-the-art FL techniques. The experimental results clearly show that Aeneas is statistically more effective than baselines, e.g., our approach can improve the six original methods by 89% on average under the Top-1 accuracy.
一种通用的故障定位数据增强方法
数据是模型的燃料,在故障定位(FL)中仍然适用。现有的许多复杂的FL技术以代码覆盖矩阵和故障向量作为输入,期望能够发现程序实体和故障之间的相关性。然而,由于现实世界的程序规模很大,失败测试用例的数量远远少于通过测试用例的数量,因此输入的数据是高维且极不平衡的,这对FL技术的有效性构成了严重的威胁。为了克服局限性,我们提出了一种通用的数据增强方法Aeneas,该方法可以从减少的特征空间中生成合成的失败测试用例,从而更精确地定位故障。具体而言,为了提高数据增强的有效性,Aeneas首先采用修正的主成分分析(PCA)生成约简的特征空间,以更简洁地表示原始覆盖矩阵,从而提高数据合成的效率。然后,Aeneas通过条件变分自编码器(CVAE)从约简的特征空间生成合成的失败测试用例来处理数据不平衡问题。为了评估Aeneas的有效性,我们使用六种最先进的FL技术对10个程序(来自ManyBugs、SIR和Defects4J)的458个版本进行了大规模实验。实验结果清楚地表明,Aeneas在统计上比基线更有效,例如,在Top-1精度下,我们的方法可以将六种原始方法平均提高89%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信