{"title":"Semi-supervised software vulnerability assessment via code lexical and structural information fusion","authors":"Wenlong Pei, Yilin Huang, Xiang Chen, Guilong Lu, Yong Liu, Chao Ni","doi":"10.1007/s10515-025-00526-4","DOIUrl":null,"url":null,"abstract":"<div><p>In </p><p>recent years, data-driven approaches have become popular for software vulnerability assessment (SVA). However, these approaches need a large amount of labeled SVA data to construct effective SVA models. This process demands security expertise for accurate labeling, incurring significant costs and introducing potential errors. Therefore, collecting the training datasets for SVA can be a challenging task. To effectively alleviate the SVA data labeling cost, we propose an approach SURF, which makes full use of a limited amount of labeled SVA data combined with a large amount of unlabeled SVA data to train the SVA model via semi-supervised learning. Furthermore, SURF incorporates lexical information (i.e., treat the code as plain text) and structural information (i.e., treat the code as the code property graph) as bimodal inputs for the SVA model training, which can further improve the performance of SURF. Through extensive experiments, we evaluated the effectiveness of SURF on a dataset that contains C/C++ vulnerable functions from real-world software projects. The results show that only by labeling 30% of the SVA data, SURF can reach or even exceed the performance of state-of-the-art SVA baselines (such as DeepCVA and Func), even if these supervised baselines use 100% of the labeled SVA data. Furthermore, SURF can also exceed the performance of the state-of-the-art Positive-unlabeled learning baseline PILOT when both are trained on 30% of the labeled SVA data.</p></div>","PeriodicalId":55414,"journal":{"name":"Automated Software Engineering","volume":"32 2","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automated Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10515-025-00526-4","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
In
recent years, data-driven approaches have become popular for software vulnerability assessment (SVA). However, these approaches need a large amount of labeled SVA data to construct effective SVA models. This process demands security expertise for accurate labeling, incurring significant costs and introducing potential errors. Therefore, collecting the training datasets for SVA can be a challenging task. To effectively alleviate the SVA data labeling cost, we propose an approach SURF, which makes full use of a limited amount of labeled SVA data combined with a large amount of unlabeled SVA data to train the SVA model via semi-supervised learning. Furthermore, SURF incorporates lexical information (i.e., treat the code as plain text) and structural information (i.e., treat the code as the code property graph) as bimodal inputs for the SVA model training, which can further improve the performance of SURF. Through extensive experiments, we evaluated the effectiveness of SURF on a dataset that contains C/C++ vulnerable functions from real-world software projects. The results show that only by labeling 30% of the SVA data, SURF can reach or even exceed the performance of state-of-the-art SVA baselines (such as DeepCVA and Func), even if these supervised baselines use 100% of the labeled SVA data. Furthermore, SURF can also exceed the performance of the state-of-the-art Positive-unlabeled learning baseline PILOT when both are trained on 30% of the labeled SVA data.
期刊介绍:
This journal details research, tutorial papers, survey and accounts of significant industrial experience in the foundations, techniques, tools and applications of automated software engineering technology. This includes the study of techniques for constructing, understanding, adapting, and modeling software artifacts and processes.
Coverage in Automated Software Engineering examines both automatic systems and collaborative systems as well as computational models of human software engineering activities. In addition, it presents knowledge representations and artificial intelligence techniques applicable to automated software engineering, and formal techniques that support or provide theoretical foundations. The journal also includes reviews of books, software, conferences and workshops.