ConCPDP: A Cross-Project Defect Prediction Method Integrating Contrastive Pretraining and Category Boundary Adjustment

IF 1.3 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IET Software Pub Date : 2024-11-13 DOI:10.1049/2024/5102699

Hengjie Song, Yufei Pan, Feng Guo, Xue Zhang, Le Ma, Siyu Jiang

{"title":"ConCPDP: A Cross-Project Defect Prediction Method Integrating Contrastive Pretraining and Category Boundary Adjustment","authors":"Hengjie Song, Yufei Pan, Feng Guo, Xue Zhang, Le Ma, Siyu Jiang","doi":"10.1049/2024/5102699","DOIUrl":null,"url":null,"abstract":"<div>\n <p>Software defect prediction (SDP) is a crucial phase preceding the launch of software products. Cross-project defect prediction (CPDP) is introduced for the anticipation of defects in novel projects lacking defect labels. CPDP can use defect information of mature projects to speed up defect prediction for new projects. So that developers can quickly get the defect information of the new project, so that they can test the software project pertinently. At present, the predominant approaches in CPDP rely on deep learning, and the performance of the ultimate model is notably affected by the quality of the training dataset. However, the dataset of CPDP not only has few samples but also has almost no label information in new projects, which makes the general deep-learning-based CPDP model not ideal. In addition, most of the current CPDP models do not fully consider the enrichment of classification boundary samples after cross-domain, leading to suboptimal predictive capabilities of the model. To overcome these obstacles, we present contrastive learning pretraining for CPDP (ConCPDP), a CPDP method integrating contrastive pretraining and category boundary adjustment. We first perform data augmentation on the source and target domain code files and then extract the enhanced data as an abstract syntax tree (AST). The AST is then transformed into an integer sequence using specific mapping rules, serving as input for the subsequent neural network. A neural network based on bidirectional long short-term memory (Bi-LSTM) will receive an integer sequence and output a feature vector. Then, the feature vectors are input into the contrastive module to optimise the feature extraction network. The pretrained feature extractor can be fine-tuned by the maximum mean discrepancy (MMD) between the feature distribution of the source domain and the target domain and the binary classification loss on the source domain. This paper conducts a large number of experiments on the PROMISE dataset, which is commonly used for CPDP, to validate ConCPDP’s efficacy, achieving superior results in terms of <i>F</i><sub>1</sub> measure, area under curve (AUC), and Matthew’s correlation coefficient (MCC).</p>\n </div>","PeriodicalId":50378,"journal":{"name":"IET Software","volume":"2024 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/2024/5102699","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IET Software","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/2024/5102699","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Software defect prediction (SDP) is a crucial phase preceding the launch of software products. Cross-project defect prediction (CPDP) is introduced for the anticipation of defects in novel projects lacking defect labels. CPDP can use defect information of mature projects to speed up defect prediction for new projects. So that developers can quickly get the defect information of the new project, so that they can test the software project pertinently. At present, the predominant approaches in CPDP rely on deep learning, and the performance of the ultimate model is notably affected by the quality of the training dataset. However, the dataset of CPDP not only has few samples but also has almost no label information in new projects, which makes the general deep-learning-based CPDP model not ideal. In addition, most of the current CPDP models do not fully consider the enrichment of classification boundary samples after cross-domain, leading to suboptimal predictive capabilities of the model. To overcome these obstacles, we present contrastive learning pretraining for CPDP (ConCPDP), a CPDP method integrating contrastive pretraining and category boundary adjustment. We first perform data augmentation on the source and target domain code files and then extract the enhanced data as an abstract syntax tree (AST). The AST is then transformed into an integer sequence using specific mapping rules, serving as input for the subsequent neural network. A neural network based on bidirectional long short-term memory (Bi-LSTM) will receive an integer sequence and output a feature vector. Then, the feature vectors are input into the contrastive module to optimise the feature extraction network. The pretrained feature extractor can be fine-tuned by the maximum mean discrepancy (MMD) between the feature distribution of the source domain and the target domain and the binary classification loss on the source domain. This paper conducts a large number of experiments on the PROMISE dataset, which is commonly used for CPDP, to validate ConCPDP’s efficacy, achieving superior results in terms of F₁ measure, area under curve (AUC), and Matthew’s correlation coefficient (MCC).

Abstract Image

查看原文本刊更多论文

ConCPDP：整合对比预训练和类别边界调整的跨项目缺陷预测方法

软件缺陷预测（SDP）是软件产品发布前的一个关键阶段。跨项目缺陷预测（CPDP）是为预测缺乏缺陷标签的新项目中的缺陷而引入的。CPDP 可以利用成熟项目的缺陷信息来加快新项目的缺陷预测。这样，开发人员就能快速获得新项目的缺陷信息，从而有针对性地测试软件项目。目前，CPDP 的主要方法依赖于深度学习，而最终模型的性能明显受到训练数据集质量的影响。然而，CPDP 的数据集不仅样本少，而且新项目几乎没有标签信息，这使得基于深度学习的一般 CPDP 模型并不理想。此外，目前的 CPDP 模型大多没有充分考虑跨域后分类边界样本的丰富性，导致模型的预测能力不理想。为了克服这些障碍，我们提出了 CPDP 的对比学习预训练（ConCPDP），这是一种整合了对比预训练和类别边界调整的 CPDP 方法。我们首先对源代码文件和目标领域代码文件进行数据增强，然后将增强后的数据提取为抽象语法树（AST）。然后使用特定的映射规则将 AST 转换为整数序列，作为后续神经网络的输入。基于双向长短期记忆（Bi-LSTM）的神经网络将接收整数序列并输出特征向量。然后，将特征向量输入对比模块，以优化特征提取网络。预训练的特征提取器可根据源域和目标域特征分布之间的最大平均差异（MMD）以及源域的二元分类损失进行微调。本文在 CPDP 常用的 PROMISE 数据集上进行了大量实验，验证了 ConCPDP 的有效性，在 F1 指标、曲线下面积（AUC）和马太相关系数（MCC）方面取得了优异的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IET Software 工程技术-计算机：软件工程

CiteScore

4.20

自引率

0.00%

发文量

审稿时长

9 months

期刊介绍： IET Software publishes papers on all aspects of the software lifecycle, including design, development, implementation and maintenance. The focus of the journal is on the methods used to develop and maintain software, and their practical application. Authors are especially encouraged to submit papers on the following topics, although papers on all aspects of software engineering are welcome: Software and systems requirements engineering Formal methods, design methods, practice and experience Software architecture, aspect and object orientation, reuse and re-engineering Testing, verification and validation techniques Software dependability and measurement Human systems engineering and human-computer interaction Knowledge engineering; expert and knowledge-based systems, intelligent agents Information systems engineering Application of software engineering in industry and commerce Software engineering technology transfer Management of software development Theoretical aspects of software development Machine learning Big data and big code Cloud computing Current Special Issue. Call for papers: Knowledge Discovery for Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_KDSD.pdf Big Data Analytics for Sustainable Software Development - https://digital-library.theiet.org/files/IET_SEN_CFP_BDASSD.pdf