基于语句语义学习和最大平均差异的跨项目软件缺陷预测

2021 28th Asia-Pacific Software Engineering Conference (APSEC) Pub Date : 2021-12-01 DOI:10.1109/APSEC53868.2021.00044

Wangshu Liu, Yongteng Zhu, Xiang Chen, Qing Gu, Xingya Wang, Shenkai Gu

{"title":"基于语句语义学习和最大平均差异的跨项目软件缺陷预测","authors":"Wangshu Liu, Yongteng Zhu, Xiang Chen, Qing Gu, Xingya Wang, Shenkai Gu","doi":"10.1109/APSEC53868.2021.00044","DOIUrl":null,"url":null,"abstract":"Different from within-project software defect prediction (WPDP), cross-project software defect prediction (CPDP) does not require sufficient training data and can help developers in the early stages of software development. Recent studies tried to learn semantic features for CPDP by feeding neural networks with abstract syntax tree (AST) token vectors. However, the ASTs directly parsed from software modules usually have complex structures, which are reflected on more nodes and deeper size, and the transfer learning is not regularly adopted to further reduce the data distribution difference between the source project and the target project. To solve these problems, we aim to joint learn the statement level trees (SLT) and alleviate data distribution difference with maximum mean discrepancy (MMD) to improve defect prediction performance on CPDP. Specifically, we propose a novel cross-project defect prediction method S2LMMD via statement semantic learning and MMD. We first construct the SLT by splitting the original AST on specified node. Then we generate more effective semantic features by learning of sequence embedding with Bi-GRU neural network. Finally, a transfer loss MMD is carried out to keep more common characteristics across different project datasets to further improve CPDP performance. To verify the effectiveness of our proposed method, we conducted experiments on ten widely used open-source projects and evaluated the experimental performance by using AUC measures. Our empirical results show that our proposed method S2LMMD can significantly outperform eight state-of-the-art baselines. In addition, for semantic learning, SLT has a higher influence on CPDP, while MMD is of great significance in transfer learning.","PeriodicalId":143800,"journal":{"name":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"S2 LMMD: Cross-Project Software Defect Prediction via Statement Semantic Learning and Maximum Mean Discrepancy\",\"authors\":\"Wangshu Liu, Yongteng Zhu, Xiang Chen, Qing Gu, Xingya Wang, Shenkai Gu\",\"doi\":\"10.1109/APSEC53868.2021.00044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Different from within-project software defect prediction (WPDP), cross-project software defect prediction (CPDP) does not require sufficient training data and can help developers in the early stages of software development. Recent studies tried to learn semantic features for CPDP by feeding neural networks with abstract syntax tree (AST) token vectors. However, the ASTs directly parsed from software modules usually have complex structures, which are reflected on more nodes and deeper size, and the transfer learning is not regularly adopted to further reduce the data distribution difference between the source project and the target project. To solve these problems, we aim to joint learn the statement level trees (SLT) and alleviate data distribution difference with maximum mean discrepancy (MMD) to improve defect prediction performance on CPDP. Specifically, we propose a novel cross-project defect prediction method S2LMMD via statement semantic learning and MMD. We first construct the SLT by splitting the original AST on specified node. Then we generate more effective semantic features by learning of sequence embedding with Bi-GRU neural network. Finally, a transfer loss MMD is carried out to keep more common characteristics across different project datasets to further improve CPDP performance. To verify the effectiveness of our proposed method, we conducted experiments on ten widely used open-source projects and evaluated the experimental performance by using AUC measures. Our empirical results show that our proposed method S2LMMD can significantly outperform eight state-of-the-art baselines. In addition, for semantic learning, SLT has a higher influence on CPDP, while MMD is of great significance in transfer learning.\",\"PeriodicalId\":143800,\"journal\":{\"name\":\"2021 28th Asia-Pacific Software Engineering Conference (APSEC)\",\"volume\":\"2015 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 28th Asia-Pacific Software Engineering Conference (APSEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/APSEC53868.2021.00044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSEC53868.2021.00044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

与项目内软件缺陷预测(WPDP)不同，跨项目软件缺陷预测(CPDP)不需要足够的训练数据，可以在软件开发的早期阶段帮助开发人员。最近的研究试图通过向神经网络输入抽象语法树(AST)标记向量来学习CPDP的语义特征。但是，直接从软件模块中解析出来的ast通常结构复杂，体现在节点较多、规模更深，并且没有定期采用迁移学习来进一步减小源项目与目标项目之间的数据分布差异。为了解决这些问题，我们旨在联合学习语句层次树(SLT)，并利用最大平均差异(MMD)来缓解数据分布差异，以提高CPDP上的缺陷预测性能。具体来说，我们提出了一种基于语句语义学习和MMD的跨项目缺陷预测方法S2LMMD。我们首先通过在指定节点上分割原始AST来构造SLT。然后利用Bi-GRU神经网络学习序列嵌入，生成更有效的语义特征。最后，进行转移损失MMD，以保持不同项目数据集的更多共同特征，从而进一步提高CPDP性能。为了验证我们提出的方法的有效性，我们在10个广泛使用的开源项目上进行了实验，并使用AUC度量来评估实验性能。我们的实证结果表明，我们提出的方法S2LMMD可以显著优于八个最先进的基线。此外，在语义学习中，SLT对CPDP有较高的影响，而MMD在迁移学习中具有重要意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

S2 LMMD: Cross-Project Software Defect Prediction via Statement Semantic Learning and Maximum Mean Discrepancy

Different from within-project software defect prediction (WPDP), cross-project software defect prediction (CPDP) does not require sufficient training data and can help developers in the early stages of software development. Recent studies tried to learn semantic features for CPDP by feeding neural networks with abstract syntax tree (AST) token vectors. However, the ASTs directly parsed from software modules usually have complex structures, which are reflected on more nodes and deeper size, and the transfer learning is not regularly adopted to further reduce the data distribution difference between the source project and the target project. To solve these problems, we aim to joint learn the statement level trees (SLT) and alleviate data distribution difference with maximum mean discrepancy (MMD) to improve defect prediction performance on CPDP. Specifically, we propose a novel cross-project defect prediction method S2LMMD via statement semantic learning and MMD. We first construct the SLT by splitting the original AST on specified node. Then we generate more effective semantic features by learning of sequence embedding with Bi-GRU neural network. Finally, a transfer loss MMD is carried out to keep more common characteristics across different project datasets to further improve CPDP performance. To verify the effectiveness of our proposed method, we conducted experiments on ten widely used open-source projects and evaluated the experimental performance by using AUC measures. Our empirical results show that our proposed method S2LMMD can significantly outperform eight state-of-the-art baselines. In addition, for semantic learning, SLT has a higher influence on CPDP, while MMD is of great significance in transfer learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 28th Asia-Pacific Software Engineering Conference (APSEC)

自引率

0.00%

发文量