{"title":"CFIWSE: A Hybrid Preprocessing Approach for Defect Prediction on Imbalance Real-World Datasets","authors":"Jiaxi Xu, Jingwei Shang, Zhichang Huang","doi":"10.1109/QRS-C57518.2022.00064","DOIUrl":null,"url":null,"abstract":"Software Defect Prediction (SDP) predicts new defects through machine learning trained with historical defect data. The distribution of software defects is highly unbalanced, which hinders the construction of defect prediction models. In addition, previous studies were usually validated by public datasets based on code metrics instead of real-world data. In this work, SNA metrics and code metrics are computed on 9 representative real-world projects. A hybrid preprocessing approach for defect prediction named CFIWSE is proposed to improve SDP performance through feature selection, minority sample synthesis and noise reduction, consisting of CFS and IWSE. CFS uses correlation analysis and nearest neighbor theory for feature selection. IWSE utilizes information weights and edited nearest neighbor rule to alleviate overfitting and noise introduced from minority sample synthesis. The proposed method is verified by experiments on real-world data, and the contribution of the method components and parameter sensitivity are explored.","PeriodicalId":183728,"journal":{"name":"2022 IEEE 22nd International Conference on Software Quality, Reliability, and Security Companion (QRS-C)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 22nd International Conference on Software Quality, Reliability, and Security Companion (QRS-C)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/QRS-C57518.2022.00064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Software Defect Prediction (SDP) predicts new defects through machine learning trained with historical defect data. The distribution of software defects is highly unbalanced, which hinders the construction of defect prediction models. In addition, previous studies were usually validated by public datasets based on code metrics instead of real-world data. In this work, SNA metrics and code metrics are computed on 9 representative real-world projects. A hybrid preprocessing approach for defect prediction named CFIWSE is proposed to improve SDP performance through feature selection, minority sample synthesis and noise reduction, consisting of CFS and IWSE. CFS uses correlation analysis and nearest neighbor theory for feature selection. IWSE utilizes information weights and edited nearest neighbor rule to alleviate overfitting and noise introduced from minority sample synthesis. The proposed method is verified by experiments on real-world data, and the contribution of the method components and parameter sensitivity are explored.