{"title":"软件故障预测的两阶段数据预处理方法","authors":"Jiaqiang Chen, Shulong Liu, Wangshu Liu, Xiang Chen, Qing Gu, Daoxu Chen","doi":"10.1109/SERE.2014.15","DOIUrl":null,"url":null,"abstract":"Software fault prediction is valuable in predicting fault proneness of software modules and then limited test resources can be effectively allocated for software quality assurance. Researchers have proved that either feature selection or instance reduction can improve the performance of classification models used for fault prediction. However, to the best of our knowledge, few researchers have combined them to study the effects on classification models. Therefore we propose a novel two-stage data preprocessing approach, which incorporates both feature selection and instance reduction. In particular, in the feature selection stage, we propose a new algorithm using both feature selection and threshold-based clustering which contains both relevance analysis and redundancy control. Then in the instance reduction stage, we apply random sampling to keep the balance between the faulty and non-faulty classes. In empirical studies, we implemented five different data preprocessing schemes based on our proposed approach, and performed a comparative study on the prediction performance of the commonly used classification models. The final results demonstrate the effectiveness of our approach and further provide a guideline for achieving cost-effective data preprocessing when using our approach.","PeriodicalId":248957,"journal":{"name":"2014 Eighth International Conference on Software Security and Reliability","volume":"196 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":"{\"title\":\"A Two-Stage Data Preprocessing Approach for Software Fault Prediction\",\"authors\":\"Jiaqiang Chen, Shulong Liu, Wangshu Liu, Xiang Chen, Qing Gu, Daoxu Chen\",\"doi\":\"10.1109/SERE.2014.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software fault prediction is valuable in predicting fault proneness of software modules and then limited test resources can be effectively allocated for software quality assurance. Researchers have proved that either feature selection or instance reduction can improve the performance of classification models used for fault prediction. However, to the best of our knowledge, few researchers have combined them to study the effects on classification models. Therefore we propose a novel two-stage data preprocessing approach, which incorporates both feature selection and instance reduction. In particular, in the feature selection stage, we propose a new algorithm using both feature selection and threshold-based clustering which contains both relevance analysis and redundancy control. Then in the instance reduction stage, we apply random sampling to keep the balance between the faulty and non-faulty classes. In empirical studies, we implemented five different data preprocessing schemes based on our proposed approach, and performed a comparative study on the prediction performance of the commonly used classification models. The final results demonstrate the effectiveness of our approach and further provide a guideline for achieving cost-effective data preprocessing when using our approach.\",\"PeriodicalId\":248957,\"journal\":{\"name\":\"2014 Eighth International Conference on Software Security and Reliability\",\"volume\":\"196 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"47\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 Eighth International Conference on Software Security and Reliability\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SERE.2014.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 Eighth International Conference on Software Security and Reliability","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERE.2014.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Two-Stage Data Preprocessing Approach for Software Fault Prediction
Software fault prediction is valuable in predicting fault proneness of software modules and then limited test resources can be effectively allocated for software quality assurance. Researchers have proved that either feature selection or instance reduction can improve the performance of classification models used for fault prediction. However, to the best of our knowledge, few researchers have combined them to study the effects on classification models. Therefore we propose a novel two-stage data preprocessing approach, which incorporates both feature selection and instance reduction. In particular, in the feature selection stage, we propose a new algorithm using both feature selection and threshold-based clustering which contains both relevance analysis and redundancy control. Then in the instance reduction stage, we apply random sampling to keep the balance between the faulty and non-faulty classes. In empirical studies, we implemented five different data preprocessing schemes based on our proposed approach, and performed a comparative study on the prediction performance of the commonly used classification models. The final results demonstrate the effectiveness of our approach and further provide a guideline for achieving cost-effective data preprocessing when using our approach.