{"title":"Multi-source cross-domain vulnerability detection based on code pre-trained model","authors":"Yang Cao, Yunwei Dong","doi":"10.1016/j.infsof.2025.107764","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>In recent years, deep learning-based vulnerability detection methods have achieved significant success. These methods predict vulnerabilities by automatically learning patterns from code annotated with vulnerability information. However, labeled data is usually concentrated in a few software projects and programming languages. In practice, due to distribution discrepancy in vulnerabilities across different software projects or programming languages, vulnerability detection models trained on limited projects or a specific language often struggle to generalize to new projects or languages. Currently, cross-domain vulnerability detection methods utilize domain adaptation to reduce the distribution discrepancy between the labeled source domain and the target domain being tested. However, the language models used in existing methods limit the expressive power of feature vectors, and they only employ single-source domain adaptation methods.</div></div><div><h3>Objective:</h3><div>To address the limitations of current cross-domain vulnerability detection methods, we propose a new method for <u>M</u>ulti-<u>S</u>ource cross-domain <u>V</u>ulnerability <u>D</u>etection (<em>MSVD</em>).</div></div><div><h3>Method:</h3><div>MSVD combines two knowledge transfer methods, fine-tuning and domain adaptation. The fine-tuned code pre-trained model extracts code features, generating more meaningful code vector representations. The adversarial-based multi-source domain adaptation method aligns features between multiple source domains and the target domain, leveraging richer knowledge from multiple source domains.</div></div><div><h3>Results:</h3><div>We conducted experiments on real datasets comprising various languages and projects to evaluate the effectiveness of MSVD. Experiment results show that, compared to the baselines in the target domain, MSVD improves F1-score, accuracy, and AUC in the cross-language scenario by 2.95%<span><math><mo>∼</mo></math></span>112.90%, 4.37%<span><math><mo>∼</mo></math></span>27.65%, and 4.19%<span><math><mo>∼</mo></math></span>57.83%, respectively. Additionally, in the cross-project scenario, MSVD achieves the highest F1-score and shows superior performance in terms of accuracy and AUC.</div></div><div><h3>Conclusion:</h3><div>These results indicate that compared to the current state-of-the-art methods, MSVD significantly improves vulnerability detection performance in two cross-domain settings: cross-language and cross-project, when the target domain is unlabeled.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"185 ","pages":"Article 107764"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095058492500103X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Context:
In recent years, deep learning-based vulnerability detection methods have achieved significant success. These methods predict vulnerabilities by automatically learning patterns from code annotated with vulnerability information. However, labeled data is usually concentrated in a few software projects and programming languages. In practice, due to distribution discrepancy in vulnerabilities across different software projects or programming languages, vulnerability detection models trained on limited projects or a specific language often struggle to generalize to new projects or languages. Currently, cross-domain vulnerability detection methods utilize domain adaptation to reduce the distribution discrepancy between the labeled source domain and the target domain being tested. However, the language models used in existing methods limit the expressive power of feature vectors, and they only employ single-source domain adaptation methods.
Objective:
To address the limitations of current cross-domain vulnerability detection methods, we propose a new method for Multi-Source cross-domain Vulnerability Detection (MSVD).
Method:
MSVD combines two knowledge transfer methods, fine-tuning and domain adaptation. The fine-tuned code pre-trained model extracts code features, generating more meaningful code vector representations. The adversarial-based multi-source domain adaptation method aligns features between multiple source domains and the target domain, leveraging richer knowledge from multiple source domains.
Results:
We conducted experiments on real datasets comprising various languages and projects to evaluate the effectiveness of MSVD. Experiment results show that, compared to the baselines in the target domain, MSVD improves F1-score, accuracy, and AUC in the cross-language scenario by 2.95%112.90%, 4.37%27.65%, and 4.19%57.83%, respectively. Additionally, in the cross-project scenario, MSVD achieves the highest F1-score and shows superior performance in terms of accuracy and AUC.
Conclusion:
These results indicate that compared to the current state-of-the-art methods, MSVD significantly improves vulnerability detection performance in two cross-domain settings: cross-language and cross-project, when the target domain is unlabeled.
期刊介绍:
Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include:
• Software management, quality and metrics,
• Software processes,
• Software architecture, modelling, specification, design and programming
• Functional and non-functional software requirements
• Software testing and verification & validation
• Empirical studies of all aspects of engineering and managing software development
Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information.
The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.