学习永不停止：通过增量学习改进软件漏洞类型识别

IF 3.7 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Journal of Systems and Software Pub Date : 2025-07-05 DOI:10.1016/j.jss.2025.112544

Jiacheng Xue , Xiang Chen , Zhanqi Cui , Yong Liu

{"title":"学习永不停止：通过增量学习改进软件漏洞类型识别","authors":"Jiacheng Xue , Xiang Chen , Zhanqi Cui , Yong Liu","doi":"10.1016/j.jss.2025.112544","DOIUrl":null,"url":null,"abstract":"<div><div>As new vulnerabilities are continuously discovered, software vulnerability type identification (SVTI) data is dynamic. Moreover, SVTI data often exhibits a long-tailed distribution, where some vulnerability types (i.e., head classes) have numerous samples, while rare ones (i.e., tail classes) have very few. These issues present challenges for SVTI, such as catastrophic forgetting when learning new data and poor performance for rare vulnerability types. To address these challenges, we propose an approach <em>VulTypeIL</em>. Specifically, for incremental learning, we employ a hybrid replay strategy and a regularization strategy with EWC to alleviate the catastrophic forgetting issue. We also integrate focal loss and label smooth cross-entropy loss to tackle the long-tailed distribution issue. For model construction, we customize the verbalizer and hybrid prompt by fusing the Vulnerability code and description. Then, we perform prompt tuning on the pre-trained model CodeT5. To evaluate the effectiveness of <em>VulTypeIL</em>, we construct a large-scale SVTI dataset containing 6,269 vulnerabilities from 992 real-world projects. Our experimental results demonstrate that <em>VulTypeIL</em> outperforms state-of-the-art baselines (such as VulExplainer and LIVABLE) with a significant improvement. The ablation studies further confirm the effectiveness of key component settings (such as the incremental learning setting and long-tailed learning setting) in our approach.</div></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":"230 ","pages":"Article 112544"},"PeriodicalIF":3.7000,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning never stops: Improving software vulnerability type identification via incremental learning\",\"authors\":\"Jiacheng Xue , Xiang Chen , Zhanqi Cui , Yong Liu\",\"doi\":\"10.1016/j.jss.2025.112544\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>As new vulnerabilities are continuously discovered, software vulnerability type identification (SVTI) data is dynamic. Moreover, SVTI data often exhibits a long-tailed distribution, where some vulnerability types (i.e., head classes) have numerous samples, while rare ones (i.e., tail classes) have very few. These issues present challenges for SVTI, such as catastrophic forgetting when learning new data and poor performance for rare vulnerability types. To address these challenges, we propose an approach <em>VulTypeIL</em>. Specifically, for incremental learning, we employ a hybrid replay strategy and a regularization strategy with EWC to alleviate the catastrophic forgetting issue. We also integrate focal loss and label smooth cross-entropy loss to tackle the long-tailed distribution issue. For model construction, we customize the verbalizer and hybrid prompt by fusing the Vulnerability code and description. Then, we perform prompt tuning on the pre-trained model CodeT5. To evaluate the effectiveness of <em>VulTypeIL</em>, we construct a large-scale SVTI dataset containing 6,269 vulnerabilities from 992 real-world projects. Our experimental results demonstrate that <em>VulTypeIL</em> outperforms state-of-the-art baselines (such as VulExplainer and LIVABLE) with a significant improvement. The ablation studies further confirm the effectiveness of key component settings (such as the incremental learning setting and long-tailed learning setting) in our approach.</div></div>\",\"PeriodicalId\":51099,\"journal\":{\"name\":\"Journal of Systems and Software\",\"volume\":\"230 \",\"pages\":\"Article 112544\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems and Software\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0164121225002134\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121225002134","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

随着新的漏洞不断被发现，软件漏洞类型识别（SVTI）数据是动态的。此外，SVTI数据通常呈现长尾分布，其中某些漏洞类型（即头部类）具有大量样本，而罕见的漏洞类型（即尾部类）具有很少样本。这些问题给SVTI带来了挑战，比如在学习新数据时的灾难性遗忘，以及罕见漏洞类型的糟糕性能。为了应对这些挑战，我们提出了一种方法VulTypeIL。具体来说，对于增量学习，我们采用混合重放策略和EWC正则化策略来减轻灾难性遗忘问题。我们还结合焦点损耗和标记平滑交叉熵损耗来解决长尾分布问题。在模型构建中，我们通过融合漏洞代码和描述来定制语言表达器和混合提示符。然后，我们对预训练的模型CodeT5执行即时调优。为了评估VulTypeIL的有效性，我们构建了一个大规模的SVTI数据集，其中包含来自992个现实世界项目的6269个漏洞。我们的实验结果表明，VulTypeIL的性能显著优于最先进的基线（如vullexplainer和LIVABLE）。消融研究进一步证实了关键组件设置（如增量学习设置和长尾学习设置）在我们的方法中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning never stops: Improving software vulnerability type identification via incremental learning

As new vulnerabilities are continuously discovered, software vulnerability type identification (SVTI) data is dynamic. Moreover, SVTI data often exhibits a long-tailed distribution, where some vulnerability types (i.e., head classes) have numerous samples, while rare ones (i.e., tail classes) have very few. These issues present challenges for SVTI, such as catastrophic forgetting when learning new data and poor performance for rare vulnerability types. To address these challenges, we propose an approach VulTypeIL. Specifically, for incremental learning, we employ a hybrid replay strategy and a regularization strategy with EWC to alleviate the catastrophic forgetting issue. We also integrate focal loss and label smooth cross-entropy loss to tackle the long-tailed distribution issue. For model construction, we customize the verbalizer and hybrid prompt by fusing the Vulnerability code and description. Then, we perform prompt tuning on the pre-trained model CodeT5. To evaluate the effectiveness of VulTypeIL, we construct a large-scale SVTI dataset containing 6,269 vulnerabilities from 992 real-world projects. Our experimental results demonstrate that VulTypeIL outperforms state-of-the-art baselines (such as VulExplainer and LIVABLE) with a significant improvement. The ablation studies further confirm the effectiveness of key component settings (such as the incremental learning setting and long-tailed learning setting) in our approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Systems and Software 工程技术-计算机：理论方法

CiteScore

8.60

自引率

5.70%

发文量

193

审稿时长

16 weeks

期刊介绍： The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to: •Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution •Agile, model-driven, service-oriented, open source and global software development •Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems •Human factors and management concerns of software development •Data management and big data issues of software systems •Metrics and evaluation, data mining of software development resources •Business and economic aspects of software development processes The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.