Applying Novel Resampling Strategies To Software Defect Prediction

NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society Pub Date : 2007-06-24 DOI:10.1109/NAFIPS.2007.383813

Lourdes Pelayo, S. Dick

{"title":"Applying Novel Resampling Strategies To Software Defect Prediction","authors":"Lourdes Pelayo, S. Dick","doi":"10.1109/NAFIPS.2007.383813","DOIUrl":null,"url":null,"abstract":"Due to the tremendous complexity and sophistication of software, improving software reliability is an enormously difficult task. We study the software defect prediction problem, which focuses on predicting which modules will experience a failure during operation. Numerous studies have applied machine learning to software defect prediction; however, skewness in defect-prediction datasets usually undermines the learning algorithms. The resulting classifiers will often never predict the faulty minority class. This problem is well known in machine learning and is often referred to as learning from unbalanced datasets. We examine stratification, a widely used technique for learning unbalanced data that has received little attention in software defect prediction. Our experiments are focused on the SMOTE technique, which is a method of over-sampling minority-class examples. Our goal is to determine if SMOTE can improve recognition of defect-prone modules, and at what cost. Our experiments demonstrate that after SMOTE resampling, we have a more balanced classification. We found an improvement of at least 23% in the average geometric mean classification accuracy on four benchmark datasets.","PeriodicalId":292853,"journal":{"name":"NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society","volume":"206 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"136","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NAFIPS.2007.383813","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 136

Abstract

Due to the tremendous complexity and sophistication of software, improving software reliability is an enormously difficult task. We study the software defect prediction problem, which focuses on predicting which modules will experience a failure during operation. Numerous studies have applied machine learning to software defect prediction; however, skewness in defect-prediction datasets usually undermines the learning algorithms. The resulting classifiers will often never predict the faulty minority class. This problem is well known in machine learning and is often referred to as learning from unbalanced datasets. We examine stratification, a widely used technique for learning unbalanced data that has received little attention in software defect prediction. Our experiments are focused on the SMOTE technique, which is a method of over-sampling minority-class examples. Our goal is to determine if SMOTE can improve recognition of defect-prone modules, and at what cost. Our experiments demonstrate that after SMOTE resampling, we have a more balanced classification. We found an improvement of at least 23% in the average geometric mean classification accuracy on four benchmark datasets.

查看原文本刊更多论文

重新采样策略在软件缺陷预测中的应用

由于软件的巨大复杂性和复杂性，提高软件的可靠性是一项极其困难的任务。我们研究了软件缺陷预测问题，主要是预测哪些模块在运行过程中会出现故障。许多研究将机器学习应用于软件缺陷预测;然而，缺陷预测数据集的偏性通常会破坏学习算法。最终的分类器通常无法预测有缺陷的少数类。这个问题在机器学习中是众所周知的，通常被称为从不平衡数据集中学习。我们的实验集中在SMOTE技术上，这是一种对少数类样本进行过度采样的方法。我们的目标是确定SMOTE是否可以改进对容易出现缺陷的模块的识别，以及代价是什么。我们的实验表明，SMOTE重采样后，我们有一个更平衡的分类。我们发现在四个基准数据集上，平均几何平均分类精度至少提高了23%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society

自引率

0.00%

发文量