Online Defect Prediction for Imbalanced Data

2015 IEEE/ACM 37th IEEE International Conference on Software Engineering Pub Date : 2015-05-16 DOI:10.1109/ICSE.2015.139

Ming Tan, Lin Tan, Sashank Dara, Caleb Mayeux

{"title":"Online Defect Prediction for Imbalanced Data","authors":"Ming Tan, Lin Tan, Sashank Dara, Caleb Mayeux","doi":"10.1109/ICSE.2015.139","DOIUrl":null,"url":null,"abstract":"Many defect prediction techniques are proposed to improve software reliability. Change classification predicts defects at the change level, where a change is the modifications to one file in a commit. In this paper, we conduct the first study of applying change classification in practice. We identify two issues in the prediction process, both of which contribute to the low prediction performance. First, the data are imbalanced -- there are much fewer buggy changes than clean changes. Second, the commonly used cross-validation approach is inappropriate for evaluating the performance of change classification. To address these challenges, we apply and adapt online change classification, resampling, and updatable classification techniques to improve the classification performance. We perform the improved change classification techniques on one proprietary and six open source projects. Our results show that these techniques improve the precision of change classification by 12.2-89.5% or 6.4 -- 34.8 percentage points (pp.) on the seven projects. In addition, we integrate change classification in the development process of the proprietary project. We have learned the following lessons: 1) new solutions are needed to convince developers to use and believe prediction results, and prediction results need to be actionable, 2) new and improved classification algorithms are needed to explain the prediction results, and insensible and unactionable explanations need to be filtered or refined, and 3) new techniques are needed to improve the relatively low precision.","PeriodicalId":330487,"journal":{"name":"2015 IEEE/ACM 37th IEEE International Conference on Software Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"256","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE/ACM 37th IEEE International Conference on Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE.2015.139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 256

Abstract

Many defect prediction techniques are proposed to improve software reliability. Change classification predicts defects at the change level, where a change is the modifications to one file in a commit. In this paper, we conduct the first study of applying change classification in practice. We identify two issues in the prediction process, both of which contribute to the low prediction performance. First, the data are imbalanced -- there are much fewer buggy changes than clean changes. Second, the commonly used cross-validation approach is inappropriate for evaluating the performance of change classification. To address these challenges, we apply and adapt online change classification, resampling, and updatable classification techniques to improve the classification performance. We perform the improved change classification techniques on one proprietary and six open source projects. Our results show that these techniques improve the precision of change classification by 12.2-89.5% or 6.4 -- 34.8 percentage points (pp.) on the seven projects. In addition, we integrate change classification in the development process of the proprietary project. We have learned the following lessons: 1) new solutions are needed to convince developers to use and believe prediction results, and prediction results need to be actionable, 2) new and improved classification algorithms are needed to explain the prediction results, and insensible and unactionable explanations need to be filtered or refined, and 3) new techniques are needed to improve the relatively low precision.

查看原文本刊更多论文

不平衡数据的在线缺陷预测

为了提高软件的可靠性，提出了许多缺陷预测技术。变更分类预测变更级别上的缺陷，其中变更是对提交中的一个文件的修改。本文首次对变化分类在实践中的应用进行了研究。我们在预测过程中发现了两个问题，这两个问题都导致了较低的预测性能。首先，数据是不平衡的——有bug的更改比干净的更改少得多。其次，通常使用的交叉验证方法不适合评估变更分类的性能。为了应对这些挑战，我们应用和适应在线变化分类、重采样和可更新分类技术来提高分类性能。我们在一个专有项目和六个开放源码项目上执行改进的变更分类技术。结果表明，这些技术在7个项目上的变化分类精度提高了12.2 ~ 89.5%或6.4 ~ 34.8个百分点(pp.)。此外，我们在专有项目的开发过程中集成了变更分类。我们吸取了以下教训:1)需要新的解决方案来说服开发人员使用和相信预测结果，并且预测结果需要可操作;2)需要新的和改进的分类算法来解释预测结果，并且需要过滤或提炼不合理和不可操作的解释;3)需要新的技术来提高相对较低的精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE/ACM 37th IEEE International Conference on Software Engineering

自引率

0.00%

发文量