Enhancing long-tailed software vulnerability type classification via adaptive data augmentation and prompt tuning

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-07-22 DOI:10.1016/j.asoc.2025.113612

Long Zhang , Xiaolin Ju , Lina Gong , Jiyu Wang , Zilong Ren

{"title":"Enhancing long-tailed software vulnerability type classification via adaptive data augmentation and prompt tuning","authors":"Long Zhang , Xiaolin Ju , Lina Gong , Jiyu Wang , Zilong Ren","doi":"10.1016/j.asoc.2025.113612","DOIUrl":null,"url":null,"abstract":"<div><div>Software vulnerability type classification (SVTC) is essential for efficient and targeted remediation of vulnerabilities. With the rapid increase in software vulnerabilities, the demand for automated SVTC approaches is becoming increasingly critical. However, the SVTC is significantly affected by the long-tailed issues, where the distribution of vulnerability types is highly unbalanced. Specifically, a small number of head classes contain a large volume of samples, while a substantial portion of tail classes consists of only a limited number of samples. This imbalance poses a significant challenge to the classification accuracy of existing approaches. To alleviate these challenges, we propose an innovative approach VulTC-LTPF, which integrates prompt tuning with long-tailed learning to enhance the effectiveness of SVTC. Within VulTC-LTPF, an adaptive error-rate-based data augmentation strategy is developed. This strategy allows the SVTC model to dynamically augment data for tail classes types with limited sample size during training, thereby mitigating the impact of the long-tailed problem. Furthermore, VulTC-LTPF employs a hybrid prompt tuning strategy, aligning the training process more closely with pre-training, which enhances adaptability to downstream tasks. Unlike existing approaches that rely solely on either vulnerability description or source code, VulTC-LTPF leverages both sources of information. By incorporating a combination of hard and soft prompts, it facilitates a more comprehensive and effective classification strategy. Experimental results demonstrate that VulTC-LTPF achieves substantial performance improvements over four state-of-the-art SVTC baselines, with gains ranging from 26.1% to 55.1% in MCC. Ablation studies further validate the effectiveness of the adaptive data augmentation, prompt tuning, the integration of two types of vulnerability information, and the use of hybrid prompts. These findings highlight that VulTC-LTPF represents a promising advancement in the field of SVTC, offering significant potential for further progress in addressing software vulnerability type classification challenges.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"182 ","pages":"Article 113612"},"PeriodicalIF":7.2000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625009238","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Software vulnerability type classification (SVTC) is essential for efficient and targeted remediation of vulnerabilities. With the rapid increase in software vulnerabilities, the demand for automated SVTC approaches is becoming increasingly critical. However, the SVTC is significantly affected by the long-tailed issues, where the distribution of vulnerability types is highly unbalanced. Specifically, a small number of head classes contain a large volume of samples, while a substantial portion of tail classes consists of only a limited number of samples. This imbalance poses a significant challenge to the classification accuracy of existing approaches. To alleviate these challenges, we propose an innovative approach VulTC-LTPF, which integrates prompt tuning with long-tailed learning to enhance the effectiveness of SVTC. Within VulTC-LTPF, an adaptive error-rate-based data augmentation strategy is developed. This strategy allows the SVTC model to dynamically augment data for tail classes types with limited sample size during training, thereby mitigating the impact of the long-tailed problem. Furthermore, VulTC-LTPF employs a hybrid prompt tuning strategy, aligning the training process more closely with pre-training, which enhances adaptability to downstream tasks. Unlike existing approaches that rely solely on either vulnerability description or source code, VulTC-LTPF leverages both sources of information. By incorporating a combination of hard and soft prompts, it facilitates a more comprehensive and effective classification strategy. Experimental results demonstrate that VulTC-LTPF achieves substantial performance improvements over four state-of-the-art SVTC baselines, with gains ranging from 26.1% to 55.1% in MCC. Ablation studies further validate the effectiveness of the adaptive data augmentation, prompt tuning, the integration of two types of vulnerability information, and the use of hybrid prompts. These findings highlight that VulTC-LTPF represents a promising advancement in the field of SVTC, offering significant potential for further progress in addressing software vulnerability type classification challenges.

查看原文本刊更多论文

通过自适应数据增强和及时调优增强长尾软件漏洞类型分类

软件漏洞类型分类（SVTC）对于有效和有针对性地修复漏洞至关重要。随着软件漏洞的迅速增加，对自动化SVTC方法的需求变得越来越迫切。然而，SVTC受到长尾问题的显著影响，其中漏洞类型的分布高度不平衡。具体来说，少量的头部类别包含大量的样本，而大量的尾部类别仅包含有限数量的样本。这种不平衡对现有方法的分类精度提出了重大挑战。为了缓解这些挑战，我们提出了一种创新的方法VulTC-LTPF，该方法将快速调谐与长尾学习相结合，以提高SVTC的有效性。在VulTC-LTPF中，开发了一种基于错误率的自适应数据增强策略。该策略允许SVTC模型在训练过程中以有限的样本量动态地增加尾类类型的数据，从而减轻长尾问题的影响。此外，VulTC-LTPF采用混合提示调优策略，将训练过程与预训练更紧密地结合起来，增强了对下游任务的适应性。与仅依赖漏洞描述或源代码的现有方法不同，VulTC-LTPF利用了这两个信息源。通过结合硬提示和软提示，它促进了更全面和有效的分类策略。实验结果表明，VulTC-LTPF在四个最先进的SVTC基线上实现了实质性的性能改进，在MCC中收益从26.1%到55.1%不等。烧消研究进一步验证了自适应数据增强、提示调优、两类漏洞信息集成以及混合提示使用的有效性。这些发现突出表明，VulTC-LTPF代表了SVTC领域的一个有希望的进步，为解决软件漏洞类型分类挑战提供了巨大的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.