The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification

2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM) Pub Date : 2017-11-09 DOI:10.1109/ESEM.2017.50

K. E. Bennin, J. Keung, Akito Monden, Passakorn Phannachitta, Solomon Mensah

{"title":"The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification","authors":"K. E. Bennin, J. Keung, Akito Monden, Passakorn Phannachitta, Solomon Mensah","doi":"10.1109/ESEM.2017.50","DOIUrl":null,"url":null,"abstract":"Context: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to imbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the effect of these sampling methods on the classification and prioritization performances of defect prediction models is still unknown. Goal: To investigate the statistical and practical significance of using resampled data for constructing defect prediction models. Method: We examine the practical effects of six data sampling methods on performances of five defect prediction models. The prediction performances of the models trained on default datasets (no sampling method) are compared with that of the models trained on resampled datasets (application of sampling methods). To decide whether the performance changes are significant or not, robust statistical tests are performed and effect sizes computed. Twenty releases of ten open source projects extracted from the PROMISE repository are considered and evaluated using the AUC, pd, pf and G-mean performance measures. Results: There are statistical significant differences and practical effects on the classification performance (pd, pf and G-mean) between models trained on resampled datasets and those trained on the default datasets. However, sampling methods have no statistical and practical effects on defect prioritization performance (AUC) with small or no effect values obtained from the models trained on the resampled datasets. Conclusions: Existing sampling methods can properly set the threshold between buggy and clean samples, while they cannot improve the prediction of defect-proneness itself. Sampling methods are highly recommended for defect classification purposes when all faulty modules are to be considered for testing.","PeriodicalId":213866,"journal":{"name":"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESEM.2017.50","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

Context: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to imbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the effect of these sampling methods on the classification and prioritization performances of defect prediction models is still unknown. Goal: To investigate the statistical and practical significance of using resampled data for constructing defect prediction models. Method: We examine the practical effects of six data sampling methods on performances of five defect prediction models. The prediction performances of the models trained on default datasets (no sampling method) are compared with that of the models trained on resampled datasets (application of sampling methods). To decide whether the performance changes are significant or not, robust statistical tests are performed and effect sizes computed. Twenty releases of ten open source projects extracted from the PROMISE repository are considered and evaluated using the AUC, pd, pf and G-mean performance measures. Results: There are statistical significant differences and practical effects on the classification performance (pd, pf and G-mean) between models trained on resampled datasets and those trained on the default datasets. However, sampling methods have no statistical and practical effects on defect prioritization performance (AUC) with small or no effect values obtained from the models trained on the resampled datasets. Conclusions: Existing sampling methods can properly set the threshold between buggy and clean samples, while they cannot improve the prediction of defect-proneness itself. Sampling methods are highly recommended for defect classification purposes when all faulty modules are to be considered for testing.

查看原文本刊更多论文

数据采样方法对软件缺陷排序和分类的重要影响

背景:最近的研究表明，当数据采样方法应用于构建缺陷预测模型的不平衡训练数据时，缺陷预测模型的性能会受到影响。然而，这些采样方法对缺陷预测模型的分类和优先级性能的影响程度(程度和功率)仍然是未知的。目的:探讨利用重采样数据构建缺陷预测模型的统计意义和实际意义。方法:研究了六种数据采样方法对五种缺陷预测模型性能的实际影响。比较了在默认数据集(无采样方法)上训练的模型与在重采样数据集(应用采样方法)上训练的模型的预测性能。为了确定性能变化是否显著，进行了稳健的统计检验并计算了效应大小。从PROMISE存储库中提取的10个开源项目的20个版本被考虑并使用AUC、pd、pf和G-mean性能度量进行评估。结果:在重采样数据集上训练的模型与在默认数据集上训练的模型在分类性能(pd, pf和G-mean)上存在统计学上的显著差异和实际影响。然而，采样方法对缺陷优先级性能(AUC)没有统计和实际影响，从重采样数据集上训练的模型中获得的影响值很小或没有影响值。结论:现有的采样方法可以很好地设置bug样本和clean样本之间的阈值，但不能提高对缺陷易感性本身的预测。当所有有缺陷的模块都要考虑进行测试时，强烈建议采用抽样方法进行缺陷分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)

自引率

0.00%

发文量