基于深度神经嵌入的软件漏洞发现:比较与优化

Secur. Commun. Networks Pub Date : 2022-01-18 DOI:10.1155/2022/5203217

Xue Yuan, Guanjun Lin, Yonghang Tai, Jun Zhang

{"title":"基于深度神经嵌入的软件漏洞发现:比较与优化","authors":"Xue Yuan, Guanjun Lin, Yonghang Tai, Jun Zhang","doi":"10.1155/2022/5203217","DOIUrl":null,"url":null,"abstract":"Due to multitudinous vulnerabilities in sophisticated software programs, the detection performance of existing approaches requires further improvement. Multiple vulnerability detection approaches have been proposed to aid code inspection. Among them, there is a line of approaches that apply deep learning (DL) techniques and achieve promising results. This paper attempts to utilize CodeBERT which is a deep contextualized model as an embedding solution to facilitate the detection of vulnerabilities in C open-source projects. The application of CodeBERT for code analysis allows the rich and latent patterns within software code to be revealed, having the potential to facilitate various downstream tasks such as the detection of software vulnerability. CodeBERT inherits the architecture of BERT, providing a stacked encoder of transformer in a bidirectional structure. This facilitates the learning of vulnerable code patterns which requires long-range dependency analysis. Additionally, the multihead attention mechanism of transformer enables multiple key variables of a data flow to be focused, which is crucial for analyzing and tracing potentially vulnerable data flaws, eventually, resulting in optimized detection performance. To evaluate the effectiveness of the proposed CodeBERT-based embedding solution, four mainstream-embedding methods are compared for generating software code embeddings, including Word2Vec, GloVe, and FastText. Experimental results show that CodeBERT-based embedding outperforms other embedding models on the downstream vulnerability detection tasks. To further boost performance, we proposed to include synthetic vulnerable functions and perform synthetic and real-world data fine tuning to facilitate the model learning of C-related vulnerable code patterns. Meanwhile, we explored the suitable configuration of CodeBERT. The evaluation results show that the model with new parameters outperform some state-of-the-art detection methods in our dataset.","PeriodicalId":167643,"journal":{"name":"Secur. Commun. Networks","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Deep Neural Embedding for Software Vulnerability Discovery: Comparison and Optimization\",\"authors\":\"Xue Yuan, Guanjun Lin, Yonghang Tai, Jun Zhang\",\"doi\":\"10.1155/2022/5203217\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to multitudinous vulnerabilities in sophisticated software programs, the detection performance of existing approaches requires further improvement. Multiple vulnerability detection approaches have been proposed to aid code inspection. Among them, there is a line of approaches that apply deep learning (DL) techniques and achieve promising results. This paper attempts to utilize CodeBERT which is a deep contextualized model as an embedding solution to facilitate the detection of vulnerabilities in C open-source projects. The application of CodeBERT for code analysis allows the rich and latent patterns within software code to be revealed, having the potential to facilitate various downstream tasks such as the detection of software vulnerability. CodeBERT inherits the architecture of BERT, providing a stacked encoder of transformer in a bidirectional structure. This facilitates the learning of vulnerable code patterns which requires long-range dependency analysis. Additionally, the multihead attention mechanism of transformer enables multiple key variables of a data flow to be focused, which is crucial for analyzing and tracing potentially vulnerable data flaws, eventually, resulting in optimized detection performance. To evaluate the effectiveness of the proposed CodeBERT-based embedding solution, four mainstream-embedding methods are compared for generating software code embeddings, including Word2Vec, GloVe, and FastText. Experimental results show that CodeBERT-based embedding outperforms other embedding models on the downstream vulnerability detection tasks. To further boost performance, we proposed to include synthetic vulnerable functions and perform synthetic and real-world data fine tuning to facilitate the model learning of C-related vulnerable code patterns. Meanwhile, we explored the suitable configuration of CodeBERT. The evaluation results show that the model with new parameters outperform some state-of-the-art detection methods in our dataset.\",\"PeriodicalId\":167643,\"journal\":{\"name\":\"Secur. Commun. Networks\",\"volume\":\"80 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Secur. Commun. Networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1155/2022/5203217\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Secur. Commun. Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1155/2022/5203217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

由于复杂的软件程序中存在大量漏洞，现有方法的检测性能有待进一步提高。人们提出了多种漏洞检测方法来辅助代码检查。其中，有一系列方法应用深度学习(DL)技术并取得了有希望的结果。本文试图利用深度上下文化模型CodeBERT作为嵌入解决方案，以方便C开源项目中的漏洞检测。CodeBERT用于代码分析的应用程序允许揭示软件代码中丰富和潜在的模式，有可能促进各种下游任务，例如检测软件漏洞。CodeBERT继承了BERT的体系结构，提供了一个双向结构的堆叠编码器和转换器。这有助于学习需要长期依赖分析的易受攻击的代码模式。此外，变压器的多头关注机制使数据流的多个关键变量能够集中，这对于分析和跟踪潜在的脆弱数据缺陷至关重要，最终实现优化的检测性能。为了评估所提出的基于codebert的嵌入方案的有效性，比较了四种主流嵌入方法生成软件代码嵌入，包括Word2Vec、GloVe和FastText。实验结果表明，基于codebert的嵌入在下游漏洞检测任务上优于其他嵌入模型。为了进一步提高性能，我们建议包含合成的易受攻击的函数，并执行合成的和真实的数据微调，以促进c相关的易受攻击代码模式的模型学习。同时，我们探索了CodeBERT的合适配置。评估结果表明，使用新参数的模型优于我们数据集中一些最先进的检测方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep Neural Embedding for Software Vulnerability Discovery: Comparison and Optimization

Due to multitudinous vulnerabilities in sophisticated software programs, the detection performance of existing approaches requires further improvement. Multiple vulnerability detection approaches have been proposed to aid code inspection. Among them, there is a line of approaches that apply deep learning (DL) techniques and achieve promising results. This paper attempts to utilize CodeBERT which is a deep contextualized model as an embedding solution to facilitate the detection of vulnerabilities in C open-source projects. The application of CodeBERT for code analysis allows the rich and latent patterns within software code to be revealed, having the potential to facilitate various downstream tasks such as the detection of software vulnerability. CodeBERT inherits the architecture of BERT, providing a stacked encoder of transformer in a bidirectional structure. This facilitates the learning of vulnerable code patterns which requires long-range dependency analysis. Additionally, the multihead attention mechanism of transformer enables multiple key variables of a data flow to be focused, which is crucial for analyzing and tracing potentially vulnerable data flaws, eventually, resulting in optimized detection performance. To evaluate the effectiveness of the proposed CodeBERT-based embedding solution, four mainstream-embedding methods are compared for generating software code embeddings, including Word2Vec, GloVe, and FastText. Experimental results show that CodeBERT-based embedding outperforms other embedding models on the downstream vulnerability detection tasks. To further boost performance, we proposed to include synthetic vulnerable functions and perform synthetic and real-world data fine tuning to facilitate the model learning of C-related vulnerable code patterns. Meanwhile, we explored the suitable configuration of CodeBERT. The evaluation results show that the model with new parameters outperform some state-of-the-art detection methods in our dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Secur. Commun. Networks

自引率

0.00%

发文量