基于代码预训练模型的多源跨域漏洞检测

IF 4.3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information and Software Technology Pub Date : 2025-05-22 DOI:10.1016/j.infsof.2025.107764

Yang Cao, Yunwei Dong

{"title":"基于代码预训练模型的多源跨域漏洞检测","authors":"Yang Cao, Yunwei Dong","doi":"10.1016/j.infsof.2025.107764","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>In recent years, deep learning-based vulnerability detection methods have achieved significant success. These methods predict vulnerabilities by automatically learning patterns from code annotated with vulnerability information. However, labeled data is usually concentrated in a few software projects and programming languages. In practice, due to distribution discrepancy in vulnerabilities across different software projects or programming languages, vulnerability detection models trained on limited projects or a specific language often struggle to generalize to new projects or languages. Currently, cross-domain vulnerability detection methods utilize domain adaptation to reduce the distribution discrepancy between the labeled source domain and the target domain being tested. However, the language models used in existing methods limit the expressive power of feature vectors, and they only employ single-source domain adaptation methods.</div></div><div><h3>Objective:</h3><div>To address the limitations of current cross-domain vulnerability detection methods, we propose a new method for <u>M</u>ulti-<u>S</u>ource cross-domain <u>V</u>ulnerability <u>D</u>etection (<em>MSVD</em>).</div></div><div><h3>Method:</h3><div>MSVD combines two knowledge transfer methods, fine-tuning and domain adaptation. The fine-tuned code pre-trained model extracts code features, generating more meaningful code vector representations. The adversarial-based multi-source domain adaptation method aligns features between multiple source domains and the target domain, leveraging richer knowledge from multiple source domains.</div></div><div><h3>Results:</h3><div>We conducted experiments on real datasets comprising various languages and projects to evaluate the effectiveness of MSVD. Experiment results show that, compared to the baselines in the target domain, MSVD improves F1-score, accuracy, and AUC in the cross-language scenario by 2.95%<span><math><mo>∼</mo></math></span>112.90%, 4.37%<span><math><mo>∼</mo></math></span>27.65%, and 4.19%<span><math><mo>∼</mo></math></span>57.83%, respectively. Additionally, in the cross-project scenario, MSVD achieves the highest F1-score and shows superior performance in terms of accuracy and AUC.</div></div><div><h3>Conclusion:</h3><div>These results indicate that compared to the current state-of-the-art methods, MSVD significantly improves vulnerability detection performance in two cross-domain settings: cross-language and cross-project, when the target domain is unlabeled.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107764"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-source cross-domain vulnerability detection based on code pre-trained model\",\"authors\":\"Yang Cao, Yunwei Dong\",\"doi\":\"10.1016/j.infsof.2025.107764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Context:</h3><div>In recent years, deep learning-based vulnerability detection methods have achieved significant success. These methods predict vulnerabilities by automatically learning patterns from code annotated with vulnerability information. However, labeled data is usually concentrated in a few software projects and programming languages. In practice, due to distribution discrepancy in vulnerabilities across different software projects or programming languages, vulnerability detection models trained on limited projects or a specific language often struggle to generalize to new projects or languages. Currently, cross-domain vulnerability detection methods utilize domain adaptation to reduce the distribution discrepancy between the labeled source domain and the target domain being tested. However, the language models used in existing methods limit the expressive power of feature vectors, and they only employ single-source domain adaptation methods.</div></div><div><h3>Objective:</h3><div>To address the limitations of current cross-domain vulnerability detection methods, we propose a new method for <u>M</u>ulti-<u>S</u>ource cross-domain <u>V</u>ulnerability <u>D</u>etection (<em>MSVD</em>).</div></div><div><h3>Method:</h3><div>MSVD combines two knowledge transfer methods, fine-tuning and domain adaptation. The fine-tuned code pre-trained model extracts code features, generating more meaningful code vector representations. The adversarial-based multi-source domain adaptation method aligns features between multiple source domains and the target domain, leveraging richer knowledge from multiple source domains.</div></div><div><h3>Results:</h3><div>We conducted experiments on real datasets comprising various languages and projects to evaluate the effectiveness of MSVD. Experiment results show that, compared to the baselines in the target domain, MSVD improves F1-score, accuracy, and AUC in the cross-language scenario by 2.95%<span><math><mo>∼</mo></math></span>112.90%, 4.37%<span><math><mo>∼</mo></math></span>27.65%, and 4.19%<span><math><mo>∼</mo></math></span>57.83%, respectively. Additionally, in the cross-project scenario, MSVD achieves the highest F1-score and shows superior performance in terms of accuracy and AUC.</div></div><div><h3>Conclusion:</h3><div>These results indicate that compared to the current state-of-the-art methods, MSVD significantly improves vulnerability detection performance in two cross-domain settings: cross-language and cross-project, when the target domain is unlabeled.</div></div>\",\"PeriodicalId\":54983,\"journal\":{\"name\":\"Information and Software Technology\",\"volume\":\"185 \",\"pages\":\"Article 107764\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Software Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S095058492500103X\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095058492500103X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

背景：近年来，基于深度学习的漏洞检测方法取得了显著的成功。这些方法通过从带有漏洞信息注释的代码中自动学习模式来预测漏洞。然而，标记数据通常集中在一些软件项目和编程语言中。在实践中，由于漏洞在不同软件项目或编程语言之间的分布差异，在有限的项目或特定的语言上训练的漏洞检测模型往往难以推广到新的项目或语言。目前，跨域漏洞检测方法利用域自适应来减少标记的源域与被测目标域之间的分布差异。然而，现有方法中使用的语言模型限制了特征向量的表达能力，并且只采用单源域自适应方法。目的：针对现有跨域漏洞检测方法的局限性，提出了一种多源跨域漏洞检测（MSVD）新方法。方法：MSVD结合了微调和领域自适应两种知识转移方法。经过微调的代码预训练模型提取代码特征，生成更有意义的代码向量表示。基于对抗性的多源域自适应方法在多源域和目标域之间对齐特征，利用多源域的丰富知识。结果：我们在包含多种语言和项目的真实数据集上进行了实验，以评估MSVD的有效性。实验结果表明，与目标域的基线相比，MSVD在跨语言场景下将F1-score、准确率和AUC分别提高了2.95% ~ 112.90%、4.37% ~ 27.65%和4.19% ~ 57.83%。此外，在跨项目场景中，MSVD获得了最高的f1分数，并在准确性和AUC方面表现出优越的性能。结论：这些结果表明，与目前最先进的方法相比，当目标域未标记时，MSVD在跨语言和跨项目两种跨域设置下显著提高了漏洞检测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-source cross-domain vulnerability detection based on code pre-trained model

Context:

In recent years, deep learning-based vulnerability detection methods have achieved significant success. These methods predict vulnerabilities by automatically learning patterns from code annotated with vulnerability information. However, labeled data is usually concentrated in a few software projects and programming languages. In practice, due to distribution discrepancy in vulnerabilities across different software projects or programming languages, vulnerability detection models trained on limited projects or a specific language often struggle to generalize to new projects or languages. Currently, cross-domain vulnerability detection methods utilize domain adaptation to reduce the distribution discrepancy between the labeled source domain and the target domain being tested. However, the language models used in existing methods limit the expressive power of feature vectors, and they only employ single-source domain adaptation methods.

Objective:

To address the limitations of current cross-domain vulnerability detection methods, we propose a new method for Multi-Source cross-domain Vulnerability Detection (MSVD).

Method:

MSVD combines two knowledge transfer methods, fine-tuning and domain adaptation. The fine-tuned code pre-trained model extracts code features, generating more meaningful code vector representations. The adversarial-based multi-source domain adaptation method aligns features between multiple source domains and the target domain, leveraging richer knowledge from multiple source domains.

Results:

We conducted experiments on real datasets comprising various languages and projects to evaluate the effectiveness of MSVD. Experiment results show that, compared to the baselines in the target domain, MSVD improves F1-score, accuracy, and AUC in the cross-language scenario by 2.95%

\sim

112.90%, 4.37%

\sim

27.65%, and 4.19%

\sim

57.83%, respectively. Additionally, in the cross-project scenario, MSVD achieves the highest F1-score and shows superior performance in terms of accuracy and AUC.

Conclusion:

These results indicate that compared to the current state-of-the-art methods, MSVD significantly improves vulnerability detection performance in two cross-domain settings: cross-language and cross-project, when the target domain is unlabeled.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.