基于不同机器学习技术的软件缺陷预测模型中的类不平衡问题的实证研究

2021 8th International Conference on Smart Computing and Communications (ICSCC) Pub Date : 2021-07-01 DOI:10.1109/ICSCC51209.2021.9528170

Sushant Kumar Pandey, A. Tripathi

{"title":"基于不同机器学习技术的软件缺陷预测模型中的类不平衡问题的实证研究","authors":"Sushant Kumar Pandey, A. Tripathi","doi":"10.1109/ICSCC51209.2021.9528170","DOIUrl":null,"url":null,"abstract":"Software practitioners are continuing to build advanced software defect prediction (SDP) models to help the tester find fault-prone modules. However, the Class Imbalance (CI) problem consists of uncommonly few defective instances, and more non-defective instances cause inconsistency in the performance. We have conducted 880 experiments to analyze the variation in the performance of 10 SDP models by concerning the class imbalance problem. In our experiments, we have used 22 public datasets consists of 41 software metrics, 10 baseline SDP methods, and 4 sampling techniques. We used Mathews Correlation Coefficient (MCC), which is more useful when a dataset is highly imbalanced. We have also compared the predictive performance of various ML models by applying 4 sampling techniques. To examine the performance of different SDP models, we have used the F-measure. We found the performance of the learning models is unsatisfactory, which needs to mitigate. We have also found a few surprising results, some logical patterns between classifier and sampling technique. It provides a connection between sampling technique, software matrices, and a classifier.","PeriodicalId":382982,"journal":{"name":"2021 8th International Conference on Smart Computing and Communications (ICSCC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study\",\"authors\":\"Sushant Kumar Pandey, A. Tripathi\",\"doi\":\"10.1109/ICSCC51209.2021.9528170\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software practitioners are continuing to build advanced software defect prediction (SDP) models to help the tester find fault-prone modules. However, the Class Imbalance (CI) problem consists of uncommonly few defective instances, and more non-defective instances cause inconsistency in the performance. We have conducted 880 experiments to analyze the variation in the performance of 10 SDP models by concerning the class imbalance problem. In our experiments, we have used 22 public datasets consists of 41 software metrics, 10 baseline SDP methods, and 4 sampling techniques. We used Mathews Correlation Coefficient (MCC), which is more useful when a dataset is highly imbalanced. We have also compared the predictive performance of various ML models by applying 4 sampling techniques. To examine the performance of different SDP models, we have used the F-measure. We found the performance of the learning models is unsatisfactory, which needs to mitigate. We have also found a few surprising results, some logical patterns between classifier and sampling technique. It provides a connection between sampling technique, software matrices, and a classifier.\",\"PeriodicalId\":382982,\"journal\":{\"name\":\"2021 8th International Conference on Smart Computing and Communications (ICSCC)\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 8th International Conference on Smart Computing and Communications (ICSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSCC51209.2021.9528170\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 8th International Conference on Smart Computing and Communications (ICSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSCC51209.2021.9528170","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

软件从业者正在继续构建高级软件缺陷预测(SDP)模型，以帮助测试人员找到容易出错的模块。然而，类不平衡(Class Imbalance, CI)问题通常由很少的缺陷实例组成，而更多的非缺陷实例会导致性能不一致。我们进行了880次实验，分析了10个SDP模型在类不平衡问题下的性能变化。在我们的实验中，我们使用了22个公共数据集，包括41个软件指标，10个基线SDP方法和4个采样技术。我们使用了马修斯相关系数(MCC)，当数据集高度不平衡时，它更有用。我们还通过应用4种采样技术比较了各种ML模型的预测性能。为了检验不同SDP模型的性能，我们使用了f度量。我们发现学习模型的性能并不令人满意，需要改进。我们还发现了一些令人惊讶的结果，分类器和抽样技术之间的一些逻辑模式。它提供了采样技术、软件矩阵和分类器之间的联系。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study

Software practitioners are continuing to build advanced software defect prediction (SDP) models to help the tester find fault-prone modules. However, the Class Imbalance (CI) problem consists of uncommonly few defective instances, and more non-defective instances cause inconsistency in the performance. We have conducted 880 experiments to analyze the variation in the performance of 10 SDP models by concerning the class imbalance problem. In our experiments, we have used 22 public datasets consists of 41 software metrics, 10 baseline SDP methods, and 4 sampling techniques. We used Mathews Correlation Coefficient (MCC), which is more useful when a dataset is highly imbalanced. We have also compared the predictive performance of various ML models by applying 4 sampling techniques. To examine the performance of different SDP models, we have used the F-measure. We found the performance of the learning models is unsatisfactory, which needs to mitigate. We have also found a few surprising results, some logical patterns between classifier and sampling technique. It provides a connection between sampling technique, software matrices, and a classifier.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 8th International Conference on Smart Computing and Communications (ICSCC)

自引率

0.00%

发文量