{"title":"基于实例的跨项目及时缺陷预测新方法","authors":"Xiaoyan Zhu, Tian Qiu, Jiayin Wang, Xin Lai","doi":"10.1002/spe.3316","DOIUrl":null,"url":null,"abstract":"Cross-project (CP) just-in-time software defect prediction (JIT-SDP) uses CP data to overcome initial data scarcity for training high-performing JIT-SDP classifiers in the early stages of software projects. The primary challenge faced by JIT-SDP in a cross-project context lies in the distinct distributions between training and test data. To tackle this issue, we select source data instances that closely resemble target data for building classifiers. Software datasets commonly exhibit a class imbalance problem, where the ratio of the defective class to the clean class is notably low. This imbalance typically diminishes classifier performance. In this study, we propose an instance selection method utilizing kernel mean matching (ISKMM) that addresses both knowledge transfer and class imbalance in cross-project defect prediction (CPDP). The method employs the kernel mean matching (KMM) technique to assess the similarity between training and target data. It selects instances with high similarity, retains them, and resamples the data based on similarity weighting to mitigate the class imbalance problem. Our experiments, conducted on 10 open-source projects, reveal that the ISKMM algorithm outperforms existing CP single-source software defect prediction (SDP) algorithms. Moreover, when employing the proposed algorithm, defect predictors constructed from cross-project data demonstrate an overall performance comparable to predictors learned from within-project data.","PeriodicalId":21899,"journal":{"name":"Software: Practice and Experience","volume":"122 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel instance-based method for cross-project just-in-time defect prediction\",\"authors\":\"Xiaoyan Zhu, Tian Qiu, Jiayin Wang, Xin Lai\",\"doi\":\"10.1002/spe.3316\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-project (CP) just-in-time software defect prediction (JIT-SDP) uses CP data to overcome initial data scarcity for training high-performing JIT-SDP classifiers in the early stages of software projects. The primary challenge faced by JIT-SDP in a cross-project context lies in the distinct distributions between training and test data. To tackle this issue, we select source data instances that closely resemble target data for building classifiers. Software datasets commonly exhibit a class imbalance problem, where the ratio of the defective class to the clean class is notably low. This imbalance typically diminishes classifier performance. In this study, we propose an instance selection method utilizing kernel mean matching (ISKMM) that addresses both knowledge transfer and class imbalance in cross-project defect prediction (CPDP). The method employs the kernel mean matching (KMM) technique to assess the similarity between training and target data. It selects instances with high similarity, retains them, and resamples the data based on similarity weighting to mitigate the class imbalance problem. Our experiments, conducted on 10 open-source projects, reveal that the ISKMM algorithm outperforms existing CP single-source software defect prediction (SDP) algorithms. Moreover, when employing the proposed algorithm, defect predictors constructed from cross-project data demonstrate an overall performance comparable to predictors learned from within-project data.\",\"PeriodicalId\":21899,\"journal\":{\"name\":\"Software: Practice and Experience\",\"volume\":\"122 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Software: Practice and Experience\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1002/spe.3316\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Software: Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/spe.3316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A novel instance-based method for cross-project just-in-time defect prediction
Cross-project (CP) just-in-time software defect prediction (JIT-SDP) uses CP data to overcome initial data scarcity for training high-performing JIT-SDP classifiers in the early stages of software projects. The primary challenge faced by JIT-SDP in a cross-project context lies in the distinct distributions between training and test data. To tackle this issue, we select source data instances that closely resemble target data for building classifiers. Software datasets commonly exhibit a class imbalance problem, where the ratio of the defective class to the clean class is notably low. This imbalance typically diminishes classifier performance. In this study, we propose an instance selection method utilizing kernel mean matching (ISKMM) that addresses both knowledge transfer and class imbalance in cross-project defect prediction (CPDP). The method employs the kernel mean matching (KMM) technique to assess the similarity between training and target data. It selects instances with high similarity, retains them, and resamples the data based on similarity weighting to mitigate the class imbalance problem. Our experiments, conducted on 10 open-source projects, reveal that the ISKMM algorithm outperforms existing CP single-source software defect prediction (SDP) algorithms. Moreover, when employing the proposed algorithm, defect predictors constructed from cross-project data demonstrate an overall performance comparable to predictors learned from within-project data.