Improve cross-project just-in-time defect prediction with dynamic transfer learning

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2024-09-20 DOI:10.1016/j.jss.2024.112214

Hongming Dai , Jianqing Xi , Hong-Liang Dai

{"title":"Improve cross-project just-in-time defect prediction with dynamic transfer learning","authors":"Hongming Dai , Jianqing Xi , Hong-Liang Dai","doi":"10.1016/j.jss.2024.112214","DOIUrl":null,"url":null,"abstract":"<div><div>Cross-project just-in-time software defect prediction (CP-JIT-SDP) is a prominent research topic in the field of software engineering. This approach is characterized by its immediacy, accuracy, real-time feedback, and traceability, enabling it to effectively address the challenges of defect prediction in new projects or projects with limited training data. However, CP-JIT-SDP faces significant challenges due to the differences in the feature distribution between the source and target projects. To address this issue, researchers have proposed methods for adjusting marginal or conditional probability distributions. This study introduces a transfer-learning approach that integrates dynamic distribution adaptation. The kernel variance matching (KVM) method is proposed to adjust the disparity in the marginal probability distribution by recalculating the variance of the source and target projects within the reproducing kernel Hilbert space (RKHS) to minimize the variance disparity. The categorical boosting (CatBoost) algorithm is used to construct models, while the improved CORrelation ALignment (CORAL) method is applied to develop the loss function to address the difference in the conditional probability distribution. This method is abbreviated as KCC, where the symbol K represents KVM, the symbol C represents CatBoost, and the next symbol C represents improved CORAL. The KCC method aims to optimize the joint probability distribution of the source project so that it closely agrees with that of the target project through iterative and dynamic integration. Six well-known open-source projects were used to evaluate the effectiveness of the proposed method. The empirical findings indicate that the KCC method exhibited significant improvements over the baseline methods. In particular, the KCC method demonstrated an average increase of 18% in the geometric mean (G-mean), 105.4% in the Matthews correlation coefficient (MCC), 25.6% in the F1-score, and 16.9% in the area under the receiver operating characteristic curve (AUC) when compared to the baseline methods. Furthermore, the KCC method demonstrated greater stability.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"219 ","pages":"Article 112214"},"PeriodicalIF":3.7000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121224002589","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Cross-project just-in-time software defect prediction (CP-JIT-SDP) is a prominent research topic in the field of software engineering. This approach is characterized by its immediacy, accuracy, real-time feedback, and traceability, enabling it to effectively address the challenges of defect prediction in new projects or projects with limited training data. However, CP-JIT-SDP faces significant challenges due to the differences in the feature distribution between the source and target projects. To address this issue, researchers have proposed methods for adjusting marginal or conditional probability distributions. This study introduces a transfer-learning approach that integrates dynamic distribution adaptation. The kernel variance matching (KVM) method is proposed to adjust the disparity in the marginal probability distribution by recalculating the variance of the source and target projects within the reproducing kernel Hilbert space (RKHS) to minimize the variance disparity. The categorical boosting (CatBoost) algorithm is used to construct models, while the improved CORrelation ALignment (CORAL) method is applied to develop the loss function to address the difference in the conditional probability distribution. This method is abbreviated as KCC, where the symbol K represents KVM, the symbol C represents CatBoost, and the next symbol C represents improved CORAL. The KCC method aims to optimize the joint probability distribution of the source project so that it closely agrees with that of the target project through iterative and dynamic integration. Six well-known open-source projects were used to evaluate the effectiveness of the proposed method. The empirical findings indicate that the KCC method exhibited significant improvements over the baseline methods. In particular, the KCC method demonstrated an average increase of 18% in the geometric mean (G-mean), 105.4% in the Matthews correlation coefficient (MCC), 25.6% in the F1-score, and 16.9% in the area under the receiver operating characteristic curve (AUC) when compared to the baseline methods. Furthermore, the KCC method demonstrated greater stability.

查看原文本刊更多论文

利用动态迁移学习改进跨项目及时缺陷预测

跨项目及时软件缺陷预测（CP-JIT-SDP）是软件工程领域的一个突出研究课题。这种方法具有即时性、准确性、实时反馈和可追溯性等特点，能有效解决新项目或训练数据有限的项目中的缺陷预测难题。然而，由于源项目和目标项目的特征分布存在差异，CP-JIT-SDP 面临着巨大的挑战。为解决这一问题，研究人员提出了调整边际或条件概率分布的方法。本研究引入了一种整合了动态分布适应的迁移学习方法。研究提出了核方差匹配（KVM）方法，通过在再现核希尔伯特空间（RKHS）内重新计算源项目和目标项目的方差来调整边际概率分布的差异，从而使方差差异最小化。分类提升（CatBoost）算法用于构建模型，而改进的 CORrelation ALignment（CORAL）方法则用于开发损失函数，以解决条件概率分布的差异问题。这种方法简称为 KCC，其中符号 K 代表 KVM，符号 C 代表 CatBoost，下一个符号 C 代表改进 CORAL。KCC 方法旨在通过迭代和动态整合，优化源项目的联合概率分布，使其与目标项目的联合概率分布密切吻合。我们使用了六个知名的开源项目来评估所提出方法的有效性。实证研究结果表明，与基线方法相比，KCC 方法有显著改进。特别是，与基线方法相比，KCC 方法的几何平均数（G-mean）平均提高了 18%，马修斯相关系数（MCC）平均提高了 105.4%，F1 分数平均提高了 25.6%，接收器工作特征曲线下面积（AUC）平均提高了 16.9%。此外，KCC 方法的稳定性更高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.