基于语言融合适配器的迁移学习低资源代码漏洞检测

IF 4.3 2区计算机科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Consumer Electronics Pub Date : 2025-01-28 DOI:10.1109/TCE.2025.3535638

Xinyue Long;Shikai Guo;Yu Chai;Hui Li;Sumaira Ameer Jan;Qian Ma;Qiao Ning

{"title":"基于语言融合适配器的迁移学习低资源代码漏洞检测","authors":"Xinyue Long;Shikai Guo;Yu Chai;Hui Li;Sumaira Ameer Jan;Qian Ma;Qiao Ning","doi":"10.1109/TCE.2025.3535638","DOIUrl":null,"url":null,"abstract":"Software vulnerabilities pose significant security threats to modern systems, particularly those involving complex execution sequences and intricate call relationships across multiple execution points.For instance, in a scenario where a software system integrates legacy code in a low-resource programming language like PHP, detecting vulnerabilities becomes challenging due to data scarcity and the complexity of temporal relationships among code fragments. This scarcity hampers the ability to capture critical temporal features essential for identifying vulnerabilities spanning multiple execution points.Consequently, existing approaches face major limitations, including neglecting temporal information in code fragments and lacking sufficient data to enable effective generalization for models in low-resource languages.To address these challenges, we introduce TaVer, a novel approach that enhances vulnerability detection in low-resource languages by extracting complex temporal features from code fragments and employing parameter-efficient transfer learning to leverage shared knowledge from resource-rich languages. TaVer comprises two key components: 1) Code Vulnerability Detection Component: This component models temporal dependencies by leveraging execution paths extracted from Abstract Syntax Trees (ASTs), capturing both short-term variations and long-term dependencies among code fragments. This enables comprehensive extraction of complex temporal features, significantly enhancing the accuracy of vulnerability detection. 2) Cross-Lingual Transfer Component: This component learns generalizable features from resource-rich languages and efficiently transfers them to low-resource languages. By updating a small number of downstream parameters, it enhances model generalization and achieves precise vulnerability detection. We evaluated TaVer using a diverse set of programming languages from publicly available GitHub repositories, employing C as the resource-rich source language and Java, Python, and PHP as relatively low-resource target languages. Experimental results demonstrate that TaVer outperforms four state-of-the-art approaches across multiple low-resource languages. Specifically, TaVer achieves average improvements of 14.63% in Accuracy, 30.59% in Precision, 37.32% in Recall, and 33.65% in F1-Score score over the best baseline approaches.","PeriodicalId":13208,"journal":{"name":"IEEE Transactions on Consumer Electronics","volume":"71 1","pages":"1008-1023"},"PeriodicalIF":4.3000,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lingual-Fusion Adapter-Based Transfer Learning for Low-Resource Code Vulnerability Detection\",\"authors\":\"Xinyue Long;Shikai Guo;Yu Chai;Hui Li;Sumaira Ameer Jan;Qian Ma;Qiao Ning\",\"doi\":\"10.1109/TCE.2025.3535638\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software vulnerabilities pose significant security threats to modern systems, particularly those involving complex execution sequences and intricate call relationships across multiple execution points.For instance, in a scenario where a software system integrates legacy code in a low-resource programming language like PHP, detecting vulnerabilities becomes challenging due to data scarcity and the complexity of temporal relationships among code fragments. This scarcity hampers the ability to capture critical temporal features essential for identifying vulnerabilities spanning multiple execution points.Consequently, existing approaches face major limitations, including neglecting temporal information in code fragments and lacking sufficient data to enable effective generalization for models in low-resource languages.To address these challenges, we introduce TaVer, a novel approach that enhances vulnerability detection in low-resource languages by extracting complex temporal features from code fragments and employing parameter-efficient transfer learning to leverage shared knowledge from resource-rich languages. TaVer comprises two key components: 1) Code Vulnerability Detection Component: This component models temporal dependencies by leveraging execution paths extracted from Abstract Syntax Trees (ASTs), capturing both short-term variations and long-term dependencies among code fragments. This enables comprehensive extraction of complex temporal features, significantly enhancing the accuracy of vulnerability detection. 2) Cross-Lingual Transfer Component: This component learns generalizable features from resource-rich languages and efficiently transfers them to low-resource languages. By updating a small number of downstream parameters, it enhances model generalization and achieves precise vulnerability detection. We evaluated TaVer using a diverse set of programming languages from publicly available GitHub repositories, employing C as the resource-rich source language and Java, Python, and PHP as relatively low-resource target languages. Experimental results demonstrate that TaVer outperforms four state-of-the-art approaches across multiple low-resource languages. Specifically, TaVer achieves average improvements of 14.63% in Accuracy, 30.59% in Precision, 37.32% in Recall, and 33.65% in F1-Score score over the best baseline approaches.\",\"PeriodicalId\":13208,\"journal\":{\"name\":\"IEEE Transactions on Consumer Electronics\",\"volume\":\"71 1\",\"pages\":\"1008-1023\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-01-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Consumer Electronics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10856218/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Consumer Electronics","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10856218/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

软件漏洞对现代系统构成了重大的安全威胁，特别是那些涉及复杂的执行序列和跨多个执行点的复杂调用关系的系统。例如，在软件系统将遗留代码集成到像PHP这样的低资源编程语言的场景中，由于数据稀缺和代码片段之间时间关系的复杂性，检测漏洞变得具有挑战性。这种稀缺性阻碍了捕获关键时间特性的能力，这些特性对于识别跨多个执行点的漏洞至关重要。因此，现有的方法面临着主要的限制，包括忽略代码片段中的时间信息，以及缺乏足够的数据来对低资源语言中的模型进行有效的泛化。为了应对这些挑战，我们引入了TaVer，这是一种新的方法，通过从代码片段中提取复杂的时间特征，并采用参数高效迁移学习来利用资源丰富语言的共享知识，增强了低资源语言的漏洞检测。TaVer包括两个关键组件：1)代码漏洞检测组件：该组件通过利用从抽象语法树（ast）中提取的执行路径来建模时间依赖性，捕获代码片段之间的短期变化和长期依赖性。这可以全面提取复杂的时间特征，显著提高漏洞检测的准确性。2)跨语言迁移组件：该组件从资源丰富的语言中学习可概括的特征，并有效地将其迁移到资源匮乏的语言中。通过更新少量下游参数，增强模型泛化能力，实现精确的漏洞检测。我们使用来自公开可用的GitHub存储库的各种编程语言来评估TaVer，使用C作为资源丰富的源语言，Java， Python和PHP作为资源相对较少的目标语言。实验结果表明，TaVer在多种低资源语言中的表现优于四种最先进的方法。具体来说，与最佳基线方法相比，TaVer在准确率（Accuracy）、准确率（Precision）、召回率（Recall）和F1-Score得分（F1-Score score）方面平均提高了14.63%、30.59%、37.32%和33.65%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Lingual-Fusion Adapter-Based Transfer Learning for Low-Resource Code Vulnerability Detection

Software vulnerabilities pose significant security threats to modern systems, particularly those involving complex execution sequences and intricate call relationships across multiple execution points.For instance, in a scenario where a software system integrates legacy code in a low-resource programming language like PHP, detecting vulnerabilities becomes challenging due to data scarcity and the complexity of temporal relationships among code fragments. This scarcity hampers the ability to capture critical temporal features essential for identifying vulnerabilities spanning multiple execution points.Consequently, existing approaches face major limitations, including neglecting temporal information in code fragments and lacking sufficient data to enable effective generalization for models in low-resource languages.To address these challenges, we introduce TaVer, a novel approach that enhances vulnerability detection in low-resource languages by extracting complex temporal features from code fragments and employing parameter-efficient transfer learning to leverage shared knowledge from resource-rich languages. TaVer comprises two key components: 1) Code Vulnerability Detection Component: This component models temporal dependencies by leveraging execution paths extracted from Abstract Syntax Trees (ASTs), capturing both short-term variations and long-term dependencies among code fragments. This enables comprehensive extraction of complex temporal features, significantly enhancing the accuracy of vulnerability detection. 2) Cross-Lingual Transfer Component: This component learns generalizable features from resource-rich languages and efficiently transfers them to low-resource languages. By updating a small number of downstream parameters, it enhances model generalization and achieves precise vulnerability detection. We evaluated TaVer using a diverse set of programming languages from publicly available GitHub repositories, employing C as the resource-rich source language and Java, Python, and PHP as relatively low-resource target languages. Experimental results demonstrate that TaVer outperforms four state-of-the-art approaches across multiple low-resource languages. Specifically, TaVer achieves average improvements of 14.63% in Accuracy, 30.59% in Precision, 37.32% in Recall, and 33.65% in F1-Score score over the best baseline approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Consumer Electronics 工程技术-电信学

CiteScore

7.70

自引率

9.30%

发文量

审稿时长

3.3 months

期刊介绍： The main focus for the IEEE Transactions on Consumer Electronics is the engineering and research aspects of the theory, design, construction, manufacture or end use of mass market electronics, systems, software and services for consumers.