A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Mohd Mustaqeem, Tamanna Siddiqui, Suhel Mustajab
{"title":"A hybrid‐ensemble model for software defect prediction for balanced and imbalanced datasets using AI‐based techniques with feature preservation: SMERKP‐XGB","authors":"Mohd Mustaqeem, Tamanna Siddiqui, Suhel Mustajab","doi":"10.1002/smr.2731","DOIUrl":null,"url":null,"abstract":"Maintaining software quality is a significant challenge as the complexity of software is increasing with the rise of the software industry. Software defects are a primary concern in complex modules, and predicting them in the early stages of the software development life cycle (SDLC) is difficult. Previous techniques to address this issue have not been very promising. We have proposed “A hybrid ensemble model for software defect prediction using AI‐based techniques with feature preservation” to overcome this problem. We have used the National Aeronautics and Space Administration (NASA) dataset from the PROMISE repository for testing and validation. By applying exploratory data analysis (EDA), feature engineering, scaling, and standardization, we found that the dataset is imbalanced, which can negatively affect the model's performance. To address this, we have used the Synthetic Minority Oversampling (SMOTE) technique and the edited nearest neighbor (ENN) (SMOTE‐ENN). We have also used recursive feature elimination cross‐validation (RFE‐CV) with a pipeline to prevent data leaking in CV and kernel‐based principal component analysis (K‐PCA) to minimize dimensionality and selectively relevant features. The reduced dimensional data is then given to the eXtreme Gradient Boosting (XGBoost) for classification, resulting in the hybrid‐ensemble (SMERKP‐XGB) model. The proposed SMERKP‐XGB model is better than previously developed models in terms of accuracy (CM1: 97.53%, PC1: 92.05%, and PC2: 97.45%, KC1:95.65%), and area under the receiver operating characteristic curve values (CM1:96.30%, PC1:98.30%, and PC2:99.30%: KC1: 93.54) and other evaluation criteria mentioned in the literature.","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"19 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/smr.2731","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Maintaining software quality is a significant challenge as the complexity of software is increasing with the rise of the software industry. Software defects are a primary concern in complex modules, and predicting them in the early stages of the software development life cycle (SDLC) is difficult. Previous techniques to address this issue have not been very promising. We have proposed “A hybrid ensemble model for software defect prediction using AI‐based techniques with feature preservation” to overcome this problem. We have used the National Aeronautics and Space Administration (NASA) dataset from the PROMISE repository for testing and validation. By applying exploratory data analysis (EDA), feature engineering, scaling, and standardization, we found that the dataset is imbalanced, which can negatively affect the model's performance. To address this, we have used the Synthetic Minority Oversampling (SMOTE) technique and the edited nearest neighbor (ENN) (SMOTE‐ENN). We have also used recursive feature elimination cross‐validation (RFE‐CV) with a pipeline to prevent data leaking in CV and kernel‐based principal component analysis (K‐PCA) to minimize dimensionality and selectively relevant features. The reduced dimensional data is then given to the eXtreme Gradient Boosting (XGBoost) for classification, resulting in the hybrid‐ensemble (SMERKP‐XGB) model. The proposed SMERKP‐XGB model is better than previously developed models in terms of accuracy (CM1: 97.53%, PC1: 92.05%, and PC2: 97.45%, KC1:95.65%), and area under the receiver operating characteristic curve values (CM1:96.30%, PC1:98.30%, and PC2:99.30%: KC1: 93.54) and other evaluation criteria mentioned in the literature.
使用基于人工智能的特征保存技术,为平衡和不平衡数据集建立软件缺陷预测混合组合模型:SMERKP-XGB
随着软件产业的兴起,软件的复杂性不断增加,如何保持软件质量是一项重大挑战。软件缺陷是复杂模块的首要问题,而在软件开发生命周期(SDLC)的早期阶段预测软件缺陷是非常困难的。以往解决这一问题的技术并不理想。为了解决这个问题,我们提出了 "基于人工智能技术的软件缺陷预测混合集合模型"。我们使用了美国国家航空航天局(NASA)PROMISE 数据库中的数据集进行测试和验证。通过应用探索性数据分析(EDA)、特征工程、缩放和标准化,我们发现数据集是不平衡的,这会对模型的性能产生负面影响。为此,我们使用了合成少数群体过度采样(SMOTE)技术和编辑近邻(ENN)技术(SMOTE-ENN)。我们还使用了递归特征消除交叉验证(RFE-CV)和基于内核的主成分分析(K-PCA),以防止 CV 和 K-PCA 中的数据泄露,从而最小化维度并选择相关特征。然后,将降维数据交给极梯度提升(XGBoost)进行分类,最终形成混合组合(SMERKP-XGB)模型。所提出的 SMERKP-XGB 模型在准确率(CM1:97.53%;PC1:92.05%;PC2:97.45%;KC1:95.65%)和接收者工作特征曲线下面积值(CM1:96.30%;PC1:98.30%;PC2:99.30%;KC1:93.54)以及文献中提到的其他评价标准方面均优于之前开发的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Software-Evolution and Process
Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-
自引率
10.00%
发文量
109
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信