Instance Selection and Class Balancing Techniques for Cross Project Defect Prediction

2018 7th Brazilian Conference on Intelligent Systems (BRACIS) Pub Date : 2018-10-01 DOI:10.1109/BRACIS.2018.00101

Alysson Bispo, R. Prudêncio, D. V. D. Silva

{"title":"Instance Selection and Class Balancing Techniques for Cross Project Defect Prediction","authors":"Alysson Bispo, R. Prudêncio, D. V. D. Silva","doi":"10.1109/BRACIS.2018.00101","DOIUrl":null,"url":null,"abstract":"Various software metrics and statistical models have been developed to help companies to predict software defects. Traditional software defect prediction approaches use historical data about previous bugs on a project in order to build predictive machine learning models. However, in many cases the historical testing data available in a project is scarce, i.e., very few or even no labeled training instances are available, which will result on a low quality defect prediction model. In order to overcome this limitation, Cross-Project Defect Prediction (CPDP) can be adopted to learn a defect prediction model for a project of interest (i.e., a target project) by reusing (transferring) data collected from several previous projects (i.e., source projects). In this paper, we focused on neighborhood-based instance selection techniques for CPDP which select labeled instances in the source projects that are similar to the unlabeled instances available in the target project. Despite its simplicity, these techniques have limitations which were addressed in our work. First, although they can select representative source instances, the quality of the selected instances is usually not addressed. Additionally, bug prediction datasets are normally unbalanced (i.e., there are more nondefect instances than defect ones), which can harm learning performance. In this paper, we proposed a new transfer learning approach for CPDP, in which instances selected by a neighborhood-based technique are filtered by the FuzzyRough Instance Selection (FRIS) technique in order to remove noisy instances in the training set. Following, in order to solve class balancing problems, the Synthetic Minority Oversampling Technique (SMOTE) technique is adopted to oversample the minority (defect-prone) class, thus increasing the chance of finding bugs correctly. Experiments were performed on a benchmark set of Java projects, achieving promising results.","PeriodicalId":405190,"journal":{"name":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 7th Brazilian Conference on Intelligent Systems (BRACIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BRACIS.2018.00101","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Various software metrics and statistical models have been developed to help companies to predict software defects. Traditional software defect prediction approaches use historical data about previous bugs on a project in order to build predictive machine learning models. However, in many cases the historical testing data available in a project is scarce, i.e., very few or even no labeled training instances are available, which will result on a low quality defect prediction model. In order to overcome this limitation, Cross-Project Defect Prediction (CPDP) can be adopted to learn a defect prediction model for a project of interest (i.e., a target project) by reusing (transferring) data collected from several previous projects (i.e., source projects). In this paper, we focused on neighborhood-based instance selection techniques for CPDP which select labeled instances in the source projects that are similar to the unlabeled instances available in the target project. Despite its simplicity, these techniques have limitations which were addressed in our work. First, although they can select representative source instances, the quality of the selected instances is usually not addressed. Additionally, bug prediction datasets are normally unbalanced (i.e., there are more nondefect instances than defect ones), which can harm learning performance. In this paper, we proposed a new transfer learning approach for CPDP, in which instances selected by a neighborhood-based technique are filtered by the FuzzyRough Instance Selection (FRIS) technique in order to remove noisy instances in the training set. Following, in order to solve class balancing problems, the Synthetic Minority Oversampling Technique (SMOTE) technique is adopted to oversample the minority (defect-prone) class, thus increasing the chance of finding bugs correctly. Experiments were performed on a benchmark set of Java projects, achieving promising results.

查看原文本刊更多论文

跨项目缺陷预测的实例选择和类平衡技术

已经开发了各种软件度量和统计模型来帮助公司预测软件缺陷。传统的软件缺陷预测方法使用项目中以前错误的历史数据来构建预测性机器学习模型。然而，在许多情况下，项目中可用的历史测试数据是稀缺的，也就是说，很少甚至没有标记的训练实例可用，这将导致低质量的缺陷预测模型。为了克服这个限制，可以采用跨项目缺陷预测(CPDP)，通过重用(转移)从几个以前的项目(例如，源项目)收集的数据来学习感兴趣的项目(例如，目标项目)的缺陷预测模型。在本文中，我们专注于基于邻域的CPDP实例选择技术，该技术在源项目中选择与目标项目中可用的未标记实例相似的标记实例。尽管它很简单，但这些技术有局限性，我们在工作中解决了这些问题。首先，尽管它们可以选择有代表性的源实例，但通常不会解决所选实例的质量问题。此外，错误预测数据集通常是不平衡的(即，非缺陷实例比缺陷实例多)，这可能会损害学习性能。在本文中，我们提出了一种新的CPDP迁移学习方法，该方法使用基于邻域技术选择的实例通过模糊粗糙实例选择(FRIS)技术进行过滤，以去除训练集中的噪声实例。接下来，为了解决类平衡问题，我们采用了合成少数派过采样技术(Synthetic Minority Oversampling Technique, SMOTE)对少数派(有缺陷的)类进行过采样，从而增加了正确发现bug的机会。在Java项目的基准集上执行了实验，获得了有希望的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 7th Brazilian Conference on Intelligent Systems (BRACIS)

自引率

0.00%

发文量