基于搜索的跨项目缺陷预测训练数据选择

Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering Pub Date : 2016-09-09 DOI:10.1145/2972958.2972964

Seyedrebvar Hosseini, Burak Turhan, M. Mäntylä

{"title":"基于搜索的跨项目缺陷预测训练数据选择","authors":"Seyedrebvar Hosseini, Burak Turhan, M. Mäntylä","doi":"10.1145/2972958.2972964","DOIUrl":null,"url":null,"abstract":"Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction (CPDP). On the other hand, data quality is an issue to consider in CPDP. Aim: We aim at utilising the Nearest Neighbor (NN)-Filter, embedded in a genetic algorithm, for generating evolving training datasets to tackle CPDP, while accounting for potential noise in defect labels. Method: We propose a new search based training data (i.e., instance) selection approach for CPDP called GIS (Genetic Instance Selection) that looks for solutions to optimize a combined measure of F-Measure and GMean, on a validation set generated by (NN)-filter. The genetic operations consider the similarities in features and address possible noise in assigned defect labels. We use 13 datasets from PROMISE repository in order to compare the performance of GIS with benchmark CPDP methods, namely (NN)-filter and naive CPDP, as well as with within project defect prediction (WPDP). Results: Our results show that GIS is significantly better than (NN)-Filter in terms of F-Measure (p -- value ≪ 0.001, Cohen's d = 0.697) and GMean (p -- value ≪ 0.001, Cohen's d = 0.946). It also outperforms the naive CPDP approach in terms of F-Measure (p -- value ≪ 0.001, Cohen's d = 0.753) and GMean (p -- value ≪ 0.001, Cohen's d = 0.994). In addition, the performance of our approach is better than that of WPDP, again considering F-Measure (p -- value ≪ 0.001, Cohen's d = 0.227) and GMean (p -- value ≪ 0.001, Cohen's d = 0.595) values. Conclusions: We conclude that search based instance selection is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of low precision. Using different optimization goals, e.g. targeting high precision, would be a future direction to investigate.","PeriodicalId":176848,"journal":{"name":"Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"Search Based Training Data Selection For Cross Project Defect Prediction\",\"authors\":\"Seyedrebvar Hosseini, Burak Turhan, M. Mäntylä\",\"doi\":\"10.1145/2972958.2972964\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction (CPDP). On the other hand, data quality is an issue to consider in CPDP. Aim: We aim at utilising the Nearest Neighbor (NN)-Filter, embedded in a genetic algorithm, for generating evolving training datasets to tackle CPDP, while accounting for potential noise in defect labels. Method: We propose a new search based training data (i.e., instance) selection approach for CPDP called GIS (Genetic Instance Selection) that looks for solutions to optimize a combined measure of F-Measure and GMean, on a validation set generated by (NN)-filter. The genetic operations consider the similarities in features and address possible noise in assigned defect labels. We use 13 datasets from PROMISE repository in order to compare the performance of GIS with benchmark CPDP methods, namely (NN)-filter and naive CPDP, as well as with within project defect prediction (WPDP). Results: Our results show that GIS is significantly better than (NN)-Filter in terms of F-Measure (p -- value ≪ 0.001, Cohen's d = 0.697) and GMean (p -- value ≪ 0.001, Cohen's d = 0.946). It also outperforms the naive CPDP approach in terms of F-Measure (p -- value ≪ 0.001, Cohen's d = 0.753) and GMean (p -- value ≪ 0.001, Cohen's d = 0.994). In addition, the performance of our approach is better than that of WPDP, again considering F-Measure (p -- value ≪ 0.001, Cohen's d = 0.227) and GMean (p -- value ≪ 0.001, Cohen's d = 0.595) values. Conclusions: We conclude that search based instance selection is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of low precision. Using different optimization goals, e.g. targeting high precision, would be a future direction to investigate.\",\"PeriodicalId\":176848,\"journal\":{\"name\":\"Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering\",\"volume\":\"80 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2972958.2972964\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2972958.2972964","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

摘要

背景:先前的研究表明，定向训练数据或数据集选择可以为跨项目缺陷预测(CPDP)带来更好的性能。另一方面，数据质量是CPDP中需要考虑的一个问题。目的:我们的目标是利用嵌入在遗传算法中的最近邻(NN)过滤器来生成不断发展的训练数据集来解决CPDP，同时考虑缺陷标签中的潜在噪声。方法:我们提出了一种新的基于搜索的CPDP训练数据(即实例)选择方法，称为GIS(遗传实例选择)，该方法在(NN)-filter生成的验证集上寻找优化F-Measure和GMean组合度量的解决方案。遗传操作考虑了特征的相似性，并解决了指定缺陷标签中可能存在的噪声。我们使用PROMISE存储库中的13个数据集，以便将GIS的性能与基准CPDP方法(即(NN)-filter和naive CPDP)以及项目内缺陷预测(WPDP)进行比较。结果:我们的研究结果表明，GIS在F-Measure (p值≪0.001,Cohen's d = 0.697)和GMean (p值≪0.001,Cohen's d = 0.946)方面明显优于(NN)-Filter。它在F-Measure (p值≪0.001,Cohen’s d = 0.753)和GMean (p值≪0.001,Cohen’s d = 0.994)方面也优于朴素的CPDP方法。此外，考虑到F-Measure (p值≪0.001,Cohen's d = 0.227)和GMean (p值≪0.001,Cohen's d = 0.595)值，我们的方法的性能优于WPDP。结论:我们得出结论，基于搜索的实例选择是解决CPDP的一种有希望的方法。特别是，与项目内部场景的性能比较鼓励对我们的方法进行进一步的研究。然而，GIS的性能是以低精度为代价的高查全率为基础的。使用不同的优化目标，例如瞄准高精度，将是未来的研究方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Search Based Training Data Selection For Cross Project Defect Prediction

Context: Previous studies have shown that steered training data or dataset selection can lead to better performance for cross project defect prediction (CPDP). On the other hand, data quality is an issue to consider in CPDP. Aim: We aim at utilising the Nearest Neighbor (NN)-Filter, embedded in a genetic algorithm, for generating evolving training datasets to tackle CPDP, while accounting for potential noise in defect labels. Method: We propose a new search based training data (i.e., instance) selection approach for CPDP called GIS (Genetic Instance Selection) that looks for solutions to optimize a combined measure of F-Measure and GMean, on a validation set generated by (NN)-filter. The genetic operations consider the similarities in features and address possible noise in assigned defect labels. We use 13 datasets from PROMISE repository in order to compare the performance of GIS with benchmark CPDP methods, namely (NN)-filter and naive CPDP, as well as with within project defect prediction (WPDP). Results: Our results show that GIS is significantly better than (NN)-Filter in terms of F-Measure (p -- value ≪ 0.001, Cohen's d = 0.697) and GMean (p -- value ≪ 0.001, Cohen's d = 0.946). It also outperforms the naive CPDP approach in terms of F-Measure (p -- value ≪ 0.001, Cohen's d = 0.753) and GMean (p -- value ≪ 0.001, Cohen's d = 0.994). In addition, the performance of our approach is better than that of WPDP, again considering F-Measure (p -- value ≪ 0.001, Cohen's d = 0.227) and GMean (p -- value ≪ 0.001, Cohen's d = 0.595) values. Conclusions: We conclude that search based instance selection is a promising way to tackle CPDP. Especially, the performance comparison with the within project scenario encourages further investigation of our approach. However, the performance of GIS is based on high recall in the expense of low precision. Using different optimization goals, e.g. targeting high precision, would be a future direction to investigate.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering

自引率

0.00%

发文量