{"title":"学习永不停止:通过增量学习改进软件漏洞类型识别","authors":"Jiacheng Xue , Xiang Chen , Zhanqi Cui , Yong Liu","doi":"10.1016/j.jss.2025.112544","DOIUrl":null,"url":null,"abstract":"<div><div>As new vulnerabilities are continuously discovered, software vulnerability type identification (SVTI) data is dynamic. Moreover, SVTI data often exhibits a long-tailed distribution, where some vulnerability types (i.e., head classes) have numerous samples, while rare ones (i.e., tail classes) have very few. These issues present challenges for SVTI, such as catastrophic forgetting when learning new data and poor performance for rare vulnerability types. To address these challenges, we propose an approach <em>VulTypeIL</em>. Specifically, for incremental learning, we employ a hybrid replay strategy and a regularization strategy with EWC to alleviate the catastrophic forgetting issue. We also integrate focal loss and label smooth cross-entropy loss to tackle the long-tailed distribution issue. For model construction, we customize the verbalizer and hybrid prompt by fusing the Vulnerability code and description. Then, we perform prompt tuning on the pre-trained model CodeT5. To evaluate the effectiveness of <em>VulTypeIL</em>, we construct a large-scale SVTI dataset containing 6,269 vulnerabilities from 992 real-world projects. Our experimental results demonstrate that <em>VulTypeIL</em> outperforms state-of-the-art baselines (such as VulExplainer and LIVABLE) with a significant improvement. The ablation studies further confirm the effectiveness of key component settings (such as the incremental learning setting and long-tailed learning setting) in our approach.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"230 ","pages":"Article 112544"},"PeriodicalIF":3.7000,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning never stops: Improving software vulnerability type identification via incremental learning\",\"authors\":\"Jiacheng Xue , Xiang Chen , Zhanqi Cui , Yong Liu\",\"doi\":\"10.1016/j.jss.2025.112544\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>As new vulnerabilities are continuously discovered, software vulnerability type identification (SVTI) data is dynamic. Moreover, SVTI data often exhibits a long-tailed distribution, where some vulnerability types (i.e., head classes) have numerous samples, while rare ones (i.e., tail classes) have very few. These issues present challenges for SVTI, such as catastrophic forgetting when learning new data and poor performance for rare vulnerability types. To address these challenges, we propose an approach <em>VulTypeIL</em>. Specifically, for incremental learning, we employ a hybrid replay strategy and a regularization strategy with EWC to alleviate the catastrophic forgetting issue. We also integrate focal loss and label smooth cross-entropy loss to tackle the long-tailed distribution issue. For model construction, we customize the verbalizer and hybrid prompt by fusing the Vulnerability code and description. Then, we perform prompt tuning on the pre-trained model CodeT5. To evaluate the effectiveness of <em>VulTypeIL</em>, we construct a large-scale SVTI dataset containing 6,269 vulnerabilities from 992 real-world projects. Our experimental results demonstrate that <em>VulTypeIL</em> outperforms state-of-the-art baselines (such as VulExplainer and LIVABLE) with a significant improvement. The ablation studies further confirm the effectiveness of key component settings (such as the incremental learning setting and long-tailed learning setting) in our approach.</div></div>\",\"PeriodicalId\":51099,\"journal\":{\"name\":\"Journal of Systems and Software\",\"volume\":\"230 \",\"pages\":\"Article 112544\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems and Software\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0164121225002134\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225002134","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
Learning never stops: Improving software vulnerability type identification via incremental learning
As new vulnerabilities are continuously discovered, software vulnerability type identification (SVTI) data is dynamic. Moreover, SVTI data often exhibits a long-tailed distribution, where some vulnerability types (i.e., head classes) have numerous samples, while rare ones (i.e., tail classes) have very few. These issues present challenges for SVTI, such as catastrophic forgetting when learning new data and poor performance for rare vulnerability types. To address these challenges, we propose an approach VulTypeIL. Specifically, for incremental learning, we employ a hybrid replay strategy and a regularization strategy with EWC to alleviate the catastrophic forgetting issue. We also integrate focal loss and label smooth cross-entropy loss to tackle the long-tailed distribution issue. For model construction, we customize the verbalizer and hybrid prompt by fusing the Vulnerability code and description. Then, we perform prompt tuning on the pre-trained model CodeT5. To evaluate the effectiveness of VulTypeIL, we construct a large-scale SVTI dataset containing 6,269 vulnerabilities from 992 real-world projects. Our experimental results demonstrate that VulTypeIL outperforms state-of-the-art baselines (such as VulExplainer and LIVABLE) with a significant improvement. The ablation studies further confirm the effectiveness of key component settings (such as the incremental learning setting and long-tailed learning setting) in our approach.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
•Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
•Agile, model-driven, service-oriented, open source and global software development
•Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
•Human factors and management concerns of software development
•Data management and big data issues of software systems
•Metrics and evaluation, data mining of software development resources
•Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.