Cross-version defect prediction via hybrid active learning with kernel principal component analysis

Zhou Xu, Jin Liu, Xiapu Luo, Zhang Tao
{"title":"Cross-version defect prediction via hybrid active learning with kernel principal component analysis","authors":"Zhou Xu, Jin Liu, Xiapu Luo, Zhang Tao","doi":"10.1109/SANER.2018.8330210","DOIUrl":null,"url":null,"abstract":"As defects in software modules may cause product failure and financial loss, it is critical to utilize defect prediction methods to effectively identify the potentially defective modules for a thorough inspection, especially in the early stage of software development lifecycle. For an upcoming version of a software project, it is practical to employ the historical labeled defect data of the prior versions within the same project to conduct defect prediction on the current version, i.e., Cross-Version Defect Prediction (CVDP). However, software development is a dynamic evolution process that may cause the data distribution (such as defect characteristics) to vary across versions. Furthermore, the raw features usually may not well reveal the intrinsic structure information behind the data. Therefore, it is challenging to perform effective CVDP. In this paper, we propose a two-phase CVDP framework that combines Hybrid Active Learning and Kernel PCA (HALKP) to address these two issues. In the first stage, HALKP uses a hybrid active learning method to select some informative and representative unlabeled modules from the current version for querying their labels, then merges them into the labeled modules of the prior version to form an enhanced training set. In the second stage, HALKP employs a non-linear mapping method, kernel PCA, to extract representative features by embedding the original data of two versions into a high-dimension space. We evaluate the HALKP framework on 31 versions of 10 projects with three prevalent performance indicators. The experimental results indicate that HALKP achieves encouraging results with average F-measure, g-mean and Balance of 0.480, 0.592 and 0.580, respectively and significantly outperforms nearly all baseline methods.","PeriodicalId":6602,"journal":{"name":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"41 1","pages":"209-220"},"PeriodicalIF":0.0000,"publicationDate":"2018-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"52","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SANER.2018.8330210","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 52

Abstract

As defects in software modules may cause product failure and financial loss, it is critical to utilize defect prediction methods to effectively identify the potentially defective modules for a thorough inspection, especially in the early stage of software development lifecycle. For an upcoming version of a software project, it is practical to employ the historical labeled defect data of the prior versions within the same project to conduct defect prediction on the current version, i.e., Cross-Version Defect Prediction (CVDP). However, software development is a dynamic evolution process that may cause the data distribution (such as defect characteristics) to vary across versions. Furthermore, the raw features usually may not well reveal the intrinsic structure information behind the data. Therefore, it is challenging to perform effective CVDP. In this paper, we propose a two-phase CVDP framework that combines Hybrid Active Learning and Kernel PCA (HALKP) to address these two issues. In the first stage, HALKP uses a hybrid active learning method to select some informative and representative unlabeled modules from the current version for querying their labels, then merges them into the labeled modules of the prior version to form an enhanced training set. In the second stage, HALKP employs a non-linear mapping method, kernel PCA, to extract representative features by embedding the original data of two versions into a high-dimension space. We evaluate the HALKP framework on 31 versions of 10 projects with three prevalent performance indicators. The experimental results indicate that HALKP achieves encouraging results with average F-measure, g-mean and Balance of 0.480, 0.592 and 0.580, respectively and significantly outperforms nearly all baseline methods.
基于混合主动学习和核主成分分析的跨版本缺陷预测
由于软件模块中的缺陷可能导致产品故障和经济损失,因此利用缺陷预测方法有效识别潜在缺陷模块以进行彻底检查是至关重要的,特别是在软件开发生命周期的早期阶段。对于软件项目即将到来的版本,在同一个项目中使用以前版本的历史标记缺陷数据来对当前版本进行缺陷预测,也就是跨版本缺陷预测(CVDP)是很实用的。然而,软件开发是一个动态的演进过程,它可能导致数据分布(比如缺陷特征)在不同版本之间变化。此外,原始特征通常不能很好地揭示数据背后的内在结构信息。因此,执行有效的CVDP是具有挑战性的。在本文中,我们提出了一个结合混合主动学习和核主成分分析(HALKP)的两阶段CVDP框架来解决这两个问题。在第一阶段,HALKP采用混合主动学习方法,从当前版本中选择一些信息量大且具有代表性的未标记模块进行标签查询,然后将其合并到之前版本的已标记模块中,形成增强训练集。第二阶段,HALKP采用非线性映射方法核主成分分析法,将两个版本的原始数据嵌入到高维空间中,提取具有代表性的特征。我们用三个普遍的绩效指标对10个项目的31个版本的HALKP框架进行了评估。实验结果表明,HALKP取得了令人鼓舞的结果,平均F-measure、g-mean和Balance分别为0.480、0.592和0.580,显著优于几乎所有基线方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信