On the use of evaluation measures for defect prediction studies

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis Pub Date : 2022-07-18 DOI:10.1145/3533767.3534405

Rebecca Moussa, Federica Sarro

{"title":"On the use of evaluation measures for defect prediction studies","authors":"Rebecca Moussa, Federica Sarro","doi":"10.1145/3533767.3534405","DOIUrl":null,"url":null,"abstract":"Software defect prediction research has adopted various evaluation measures to assess the performance of prediction models. In this paper, we further stress on the importance of the choice of appropriate measures in order to correctly assess strengths and weaknesses of a given defect prediction model, especially given that most of the defect prediction tasks suffer from data imbalance. Investigating 111 previous studies published between 2010 and 2020, we found out that over a half either use only one evaluation measure, which alone cannot express all the characteristics of model performance in presence of imbalanced data, or a set of binary measures which are prone to be biased when used to assess models especially when trained with imbalanced data. We also unveil the magnitude of the impact of assessing popular defect prediction models with several evaluation measures based, for the first time, on both statistical significance test and effect size analyses. Our results reveal that the evaluation measures produce a different ranking of the classification models in 82% and 85% of the cases studied according to the Wilcoxon statistical significance test and Â12 effect size, respectively. Further, we observe a very high rank disruption (between 64% to 92% on average) for each of the measures investigated. This signifies that, in the majority of the cases, a prediction technique that would be believed to be better than others when using a given evaluation measure becomes worse when using a different one. We conclude by providing some recommendations for the selection of appropriate evaluation measures based on factors which are specific to the problem at hand such as the class distribution of the training data, the way in which the model has been built and will be used. Moreover, we recommend to include in the set of evaluation measures, at least one able to capture the full picture of the confusion matrix, such as MCC. This will enable researchers to assess whether proposals made in previous work can be applied for purposes different than the ones they were originally intended for. Besides, we recommend to report, whenever possible, the raw confusion matrix to allow other researchers to compute any measure of interest thereby making it feasible to draw meaningful observations across different studies.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533767.3534405","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Software defect prediction research has adopted various evaluation measures to assess the performance of prediction models. In this paper, we further stress on the importance of the choice of appropriate measures in order to correctly assess strengths and weaknesses of a given defect prediction model, especially given that most of the defect prediction tasks suffer from data imbalance. Investigating 111 previous studies published between 2010 and 2020, we found out that over a half either use only one evaluation measure, which alone cannot express all the characteristics of model performance in presence of imbalanced data, or a set of binary measures which are prone to be biased when used to assess models especially when trained with imbalanced data. We also unveil the magnitude of the impact of assessing popular defect prediction models with several evaluation measures based, for the first time, on both statistical significance test and effect size analyses. Our results reveal that the evaluation measures produce a different ranking of the classification models in 82% and 85% of the cases studied according to the Wilcoxon statistical significance test and Â12 effect size, respectively. Further, we observe a very high rank disruption (between 64% to 92% on average) for each of the measures investigated. This signifies that, in the majority of the cases, a prediction technique that would be believed to be better than others when using a given evaluation measure becomes worse when using a different one. We conclude by providing some recommendations for the selection of appropriate evaluation measures based on factors which are specific to the problem at hand such as the class distribution of the training data, the way in which the model has been built and will be used. Moreover, we recommend to include in the set of evaluation measures, at least one able to capture the full picture of the confusion matrix, such as MCC. This will enable researchers to assess whether proposals made in previous work can be applied for purposes different than the ones they were originally intended for. Besides, we recommend to report, whenever possible, the raw confusion matrix to allow other researchers to compute any measure of interest thereby making it feasible to draw meaningful observations across different studies.

查看原文本刊更多论文

缺陷预测研究中评价方法的应用

软件缺陷预测研究采用了多种评估方法来评估预测模型的性能。在本文中，我们进一步强调了选择合适的度量的重要性，以便正确评估给定缺陷预测模型的优缺点，特别是考虑到大多数缺陷预测任务遭受数据不平衡。通过调查2010年至2020年间发表的111项研究，我们发现，超过一半的研究要么只使用一种评估指标，这种指标无法单独表达存在不平衡数据时模型性能的所有特征，要么使用一组二元指标，这些指标在用于评估模型时容易产生偏差，尤其是在使用不平衡数据进行训练时。我们还首次揭示了基于统计显著性检验和效应大小分析的几种评估方法评估流行缺陷预测模型的影响程度。我们的研究结果显示，根据Wilcoxon统计显著性检验和Â12效应大小，评价措施分别在82%和85%的研究案例中产生了不同的分类模型排名。此外，我们观察到每个被调查的措施都有非常高的秩中断(平均在64%到92%之间)。这表明，在大多数情况下，当使用给定的评估方法时，被认为比其他方法更好的预测技术在使用不同的评估方法时变得更差。最后，我们根据特定于手头问题的因素，如训练数据的类别分布、模型的构建和使用方式，为选择适当的评估措施提供了一些建议。此外，我们建议在一组评估措施中，至少包括一个能够捕捉混淆矩阵的全貌，例如MCC。这将使研究人员能够评估在以前的工作中提出的建议是否可以应用于不同于最初预期的目的。此外，我们建议尽可能报告原始混淆矩阵，以允许其他研究人员计算任何感兴趣的度量，从而使在不同研究中得出有意义的观察结果成为可能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

自引率

0.00%

发文量