Fuyang Li, Wanpeng Lu, Jacky Wai Keung, Xiao Yu, Lina Gong, Juan Li
{"title":"特征选择技术对努力感知缺陷预测的影响:一项实证研究","authors":"Fuyang Li, Wanpeng Lu, Jacky Wai Keung, Xiao Yu, Lina Gong, Juan Li","doi":"10.1049/sfw2.12099","DOIUrl":null,"url":null,"abstract":"<p>Effort-Aware Defect Prediction (EADP) methods sort software modules based on the defect density and guide the testing team to inspect the modules with high defect density first. Previous studies indicated that some feature selection methods could improve the performance of Classification-Based Defect Prediction (CBDP) models, and the Correlation-based feature subset selection method with the Best First strategy (CorBF) performed the best. However, the practical benefits of feature selection methods on EADP performance are still unknown, and blindly employing the best-performing CorBF method in CBDP to pre-process the defect datasets may not improve the performance of EADP models but possibly result in performance degradation. To assess the impact of the feature selection techniques on EADP, a total of 24 feature selection methods with 10 classifiers embedded in a state-of-the-art EADP model (CBS+) on the 41 PROMISE defect datasets were examined. We employ six evaluation metrics to assess the performance of EADP models comprehensively. The results show that (1) The impact of the feature selection methods varies in classifiers and datasets. (2) The four wrapper-based feature subset selection methods with forwards search, that is, AdaBoost with Forwards Search, Deep Forest with Forwards Search, Random Forest with Forwards Search, and XGBoost with Forwards Search (XGBF) are better than other methods across the studied classifiers and the used datasets. And XGBF with XGBoost as the embedded classifier in CBS+ performs the best on the datasets. (3) The best-performing CorBF method in CBDP does not perform well on the EADP task. (4) The selected features vary with different feature selection methods and different datasets, and the features <i>noc</i> (number of children), <i>ic</i> (inheritance coupling), <i>cbo</i> (coupling between object classes), and <i>cbm</i> (coupling between methods) are frequently selected by the four wrapper-based feature subset selection methods with forwards search. (5) Using AdaBoost, deep forest, random forest, and XGBoost as the base classifiers embedded in CBS+ can achieve the best performance. In summary, we recommend the software testing team should employ XGBF with XGBoost as the embedded classifier in CBS+ to enhance the EADP performance.</p>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"17 2","pages":"168-193"},"PeriodicalIF":1.5000,"publicationDate":"2023-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/sfw2.12099","citationCount":"9","resultStr":"{\"title\":\"The impact of feature selection techniques on effort-aware defect prediction: An empirical study\",\"authors\":\"Fuyang Li, Wanpeng Lu, Jacky Wai Keung, Xiao Yu, Lina Gong, Juan Li\",\"doi\":\"10.1049/sfw2.12099\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Effort-Aware Defect Prediction (EADP) methods sort software modules based on the defect density and guide the testing team to inspect the modules with high defect density first. Previous studies indicated that some feature selection methods could improve the performance of Classification-Based Defect Prediction (CBDP) models, and the Correlation-based feature subset selection method with the Best First strategy (CorBF) performed the best. However, the practical benefits of feature selection methods on EADP performance are still unknown, and blindly employing the best-performing CorBF method in CBDP to pre-process the defect datasets may not improve the performance of EADP models but possibly result in performance degradation. To assess the impact of the feature selection techniques on EADP, a total of 24 feature selection methods with 10 classifiers embedded in a state-of-the-art EADP model (CBS+) on the 41 PROMISE defect datasets were examined. We employ six evaluation metrics to assess the performance of EADP models comprehensively. The results show that (1) The impact of the feature selection methods varies in classifiers and datasets. (2) The four wrapper-based feature subset selection methods with forwards search, that is, AdaBoost with Forwards Search, Deep Forest with Forwards Search, Random Forest with Forwards Search, and XGBoost with Forwards Search (XGBF) are better than other methods across the studied classifiers and the used datasets. And XGBF with XGBoost as the embedded classifier in CBS+ performs the best on the datasets. (3) The best-performing CorBF method in CBDP does not perform well on the EADP task. (4) The selected features vary with different feature selection methods and different datasets, and the features <i>noc</i> (number of children), <i>ic</i> (inheritance coupling), <i>cbo</i> (coupling between object classes), and <i>cbm</i> (coupling between methods) are frequently selected by the four wrapper-based feature subset selection methods with forwards search. (5) Using AdaBoost, deep forest, random forest, and XGBoost as the base classifiers embedded in CBS+ can achieve the best performance. In summary, we recommend the software testing team should employ XGBF with XGBoost as the embedded classifier in CBS+ to enhance the EADP performance.</p>\",\"PeriodicalId\":50378,\"journal\":{\"name\":\"IET Software\",\"volume\":\"17 2\",\"pages\":\"168-193\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2023-02-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/sfw2.12099\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IET Software\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1049/sfw2.12099\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Software","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/sfw2.12099","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 9
摘要
Effort Aware Defect Prediction(EADP)方法根据缺陷密度对软件模块进行排序,并引导测试团队首先检查缺陷密度高的模块。先前的研究表明,一些特征选择方法可以提高基于分类的缺陷预测(CBDP)模型的性能,而基于相关性的特征子集选择方法和最佳优先策略(CorBF)表现最好。然而,特征选择方法对EADP性能的实际好处仍然未知,在CBDP中盲目使用性能最好的CorBF方法来预处理缺陷数据集可能不会提高EADP模型的性能,但可能导致性能下降。为了评估特征选择技术对EADP的影响,在41个PROMISE缺陷数据集上检查了总共24种特征选择方法,其中10个分类器嵌入在最先进的EADP模型(CBS+)中。我们采用了六个评估指标来全面评估EADP模型的性能。结果表明:(1)特征选择方法对分类器和数据集的影响各不相同。(2) 在所研究的分类器和所使用的数据集中,四种基于包装器的前向搜索特征子集选择方法,即AdaBoost with forwards search、Deep Forest with Forward search、Random Forest with forwards search和XGBoost with forward search(XGBF),都优于其他方法。以XGBoost作为CBS+中嵌入分类器的XGBF在数据集上表现最好。(3) CBDP中性能最好的CorBF方法在EADP任务中表现不佳。(4) 所选择的特征随着不同的特征选择方法和不同的数据集而变化,并且基于前向搜索的四种基于包装器的特征子集选择方法经常选择特征noc(子数)、ic(继承耦合)、cbo(对象类之间的耦合)和cbm(方法之间的耦合。(5) 使用AdaBoost、深层森林、随机森林和XGBoost作为嵌入CBS+的基础分类器可以获得最佳性能。总之,我们建议软件测试团队使用XGBF和XGBoost作为CBS+中的嵌入式分类器,以提高EADP性能。
The impact of feature selection techniques on effort-aware defect prediction: An empirical study
Effort-Aware Defect Prediction (EADP) methods sort software modules based on the defect density and guide the testing team to inspect the modules with high defect density first. Previous studies indicated that some feature selection methods could improve the performance of Classification-Based Defect Prediction (CBDP) models, and the Correlation-based feature subset selection method with the Best First strategy (CorBF) performed the best. However, the practical benefits of feature selection methods on EADP performance are still unknown, and blindly employing the best-performing CorBF method in CBDP to pre-process the defect datasets may not improve the performance of EADP models but possibly result in performance degradation. To assess the impact of the feature selection techniques on EADP, a total of 24 feature selection methods with 10 classifiers embedded in a state-of-the-art EADP model (CBS+) on the 41 PROMISE defect datasets were examined. We employ six evaluation metrics to assess the performance of EADP models comprehensively. The results show that (1) The impact of the feature selection methods varies in classifiers and datasets. (2) The four wrapper-based feature subset selection methods with forwards search, that is, AdaBoost with Forwards Search, Deep Forest with Forwards Search, Random Forest with Forwards Search, and XGBoost with Forwards Search (XGBF) are better than other methods across the studied classifiers and the used datasets. And XGBF with XGBoost as the embedded classifier in CBS+ performs the best on the datasets. (3) The best-performing CorBF method in CBDP does not perform well on the EADP task. (4) The selected features vary with different feature selection methods and different datasets, and the features noc (number of children), ic (inheritance coupling), cbo (coupling between object classes), and cbm (coupling between methods) are frequently selected by the four wrapper-based feature subset selection methods with forwards search. (5) Using AdaBoost, deep forest, random forest, and XGBoost as the base classifiers embedded in CBS+ can achieve the best performance. In summary, we recommend the software testing team should employ XGBF with XGBoost as the embedded classifier in CBS+ to enhance the EADP performance.
期刊介绍:
IET Software publishes papers on all aspects of the software lifecycle, including design, development, implementation and maintenance. The focus of the journal is on the methods used to develop and maintain software, and their practical application.
Authors are especially encouraged to submit papers on the following topics, although papers on all aspects of software engineering are welcome:
Software and systems requirements engineering
Formal methods, design methods, practice and experience
Software architecture, aspect and object orientation, reuse and re-engineering
Testing, verification and validation techniques
Software dependability and measurement
Human systems engineering and human-computer interaction
Knowledge engineering; expert and knowledge-based systems, intelligent agents
Information systems engineering
Application of software engineering in industry and commerce
Software engineering technology transfer
Management of software development
Theoretical aspects of software development
Machine learning
Big data and big code
Cloud computing
Current Special Issue. Call for papers:
Knowledge Discovery for Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_KDSD.pdf
Big Data Analytics for Sustainable Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_BDASSD.pdf