Unveiling the impact of unchanged modules across versions on the evaluation of within-project defect prediction models

IF 1.7 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Xutong Liu, Yufei Zhou, Zeyu Lu, Yuanqing Mei, Yibiao Yang, Junyan Qian, Yuming Zhou
{"title":"Unveiling the impact of unchanged modules across versions on the evaluation of within-project defect prediction models","authors":"Xutong Liu,&nbsp;Yufei Zhou,&nbsp;Zeyu Lu,&nbsp;Yuanqing Mei,&nbsp;Yibiao Yang,&nbsp;Junyan Qian,&nbsp;Yuming Zhou","doi":"10.1002/smr.2715","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Software defect prediction (SDP) is a topic actively researched in the software engineering community. Within-project defect prediction (WPDP) involves using labeled modules from previous versions of the same project to train classifiers. Over time, many defect prediction models have been evaluated under the WPDP scenario.</p>\n </section>\n \n <section>\n \n <h3> Problem</h3>\n \n <p>Data duplication poses a significant challenge in current WPDP evaluation procedures. Unchanged modules, characterized by identical executable source code, are frequently present in both target and source versions during experimentation. However, it is still unclear how and to what extent the presence of unchanged modules affects the performance assessment of WPDP models and the comparison of multiple WPDP models.</p>\n </section>\n \n <section>\n \n <h3> Method</h3>\n \n <p>In this paper, we provide a method to detect and remove unchanged modules from defect datasets and unveil the impact of data duplication in WPDP on model evaluation.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>The experiments conducted on 481 target versions from 62 projects provide evidence that data duplication significantly affects the reported performance values of individual learners in WPDP. However, when ranking multiple WPDP models based on prediction performance, the impact of removing unchanged instances is not substantial. Nevertheless, it is important to note that removing unchanged instances does have a slight influence on the selection of models with better generalization.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>We recommend that future WPDP studies take into consideration the removal of unchanged modules from target versions when evaluating the performance of their models. This practice will enhance the reliability and validity of the results obtained in WPDP research, leading to improved understanding and advancements in defect prediction models.</p>\n </section>\n </div>","PeriodicalId":48898,"journal":{"name":"Journal of Software-Evolution and Process","volume":"36 12","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Software-Evolution and Process","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/smr.2715","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Software defect prediction (SDP) is a topic actively researched in the software engineering community. Within-project defect prediction (WPDP) involves using labeled modules from previous versions of the same project to train classifiers. Over time, many defect prediction models have been evaluated under the WPDP scenario.

Problem

Data duplication poses a significant challenge in current WPDP evaluation procedures. Unchanged modules, characterized by identical executable source code, are frequently present in both target and source versions during experimentation. However, it is still unclear how and to what extent the presence of unchanged modules affects the performance assessment of WPDP models and the comparison of multiple WPDP models.

Method

In this paper, we provide a method to detect and remove unchanged modules from defect datasets and unveil the impact of data duplication in WPDP on model evaluation.

Results

The experiments conducted on 481 target versions from 62 projects provide evidence that data duplication significantly affects the reported performance values of individual learners in WPDP. However, when ranking multiple WPDP models based on prediction performance, the impact of removing unchanged instances is not substantial. Nevertheless, it is important to note that removing unchanged instances does have a slight influence on the selection of models with better generalization.

Conclusion

We recommend that future WPDP studies take into consideration the removal of unchanged modules from target versions when evaluating the performance of their models. This practice will enhance the reliability and validity of the results obtained in WPDP research, leading to improved understanding and advancements in defect prediction models.

揭示不同版本中不变模块对项目内缺陷预测模型评估的影响
背景软件缺陷预测(SDP)是软件工程界积极研究的一个课题。项目内缺陷预测(WPDP)涉及使用同一项目以前版本的标注模块来训练分类器。随着时间的推移,许多缺陷预测模型都在 WPDP 情景下进行了评估。问题数据重复给当前的 WPDP 评估程序带来了巨大挑战。在实验过程中,目标版本和源代码版本中经常会出现未改变的模块,这些模块的特点是可执行源代码完全相同。方法在本文中,我们提供了一种从缺陷数据集中检测和移除未修改模块的方法,并揭示了 WPDP 中数据重复对模型评估的影响。结果在 62 个项目的 481 个目标版本上进行的实验提供了证据,证明数据重复会显著影响 WPDP 中单个学习者的报告性能值。不过,在根据预测性能对多个 WPDP 模型进行排名时,删除未更改实例的影响并不大。结论我们建议,未来的 WPDP 研究在评估模型性能时,应考虑从目标版本中删除未改变的模块。这种做法将提高 WPDP 研究结果的可靠性和有效性,从而加深对缺陷预测模型的理解并推动缺陷预测模型的发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Software-Evolution and Process
Journal of Software-Evolution and Process COMPUTER SCIENCE, SOFTWARE ENGINEERING-
自引率
10.00%
发文量
109
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信