基于深度学习的漏洞检测技术中C/ c++漏洞数据集的代码中心评估

Proceedings of the 16th Innovations in Software Engineering Conference Pub Date : 2023-02-23 DOI:10.1145/3578527.3578530

Ridhi Jain, Nicole Gervasoni, Mthandazo Ndhlovu, Sanjay Rawat

{"title":"基于深度学习的漏洞检测技术中C/ c++漏洞数据集的代码中心评估","authors":"Ridhi Jain, Nicole Gervasoni, Mthandazo Ndhlovu, Sanjay Rawat","doi":"10.1145/3578527.3578530","DOIUrl":null,"url":null,"abstract":"Recent years have witnessed tremendous progress in NLP-based code comprehension via deep neural networks (DNN) learning, especially Large Language Models (LLMs). While the original application of LLMs is focused on code generation, there have been attempts to extend the application to more specialized tasks, like code similarity, author attribution, code repairs, and so on. As data plays an important role in the success of any machine learning approach, researchers have also proposed several benchmarks which are coupled with a specific task at hand. It is well known in the machine learning (ML) community that the presence of biases in the dataset affects the quality of the ML algorithm in a real-world scenario. This paper evaluates several existing datasets from DNN’s application perspective. We specifically focus on training datasets of C/C++ language code. Our choice of language stems from the fact that while LLM-based techniques have been applied and evaluated on programming languages like Python, JavaScript, and Ruby, there is not much LLM research for C/C++. As a result, datasets generated synthetically or from real-world codes are in individual research work. Consequently, in the absence of a uniform dataset, such works are hard to compare with each other. In this work, we aim to achieve two main objectives– 1. propose code-centric features that are relevant to security program analysis tasks like vulnerability detection; 2. a thorough (qualitative and quantitative) examination of the existing code datasets that demonstrate the main characteristics of the individual datasets to have a clear comparison. Our evaluation finds exciting facts about existing datasets highlighting gaps that need to be addressed.","PeriodicalId":326318,"journal":{"name":"Proceedings of the 16th Innovations in Software Engineering Conference","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Code Centric Evaluation of C/C++ Vulnerability Datasets for Deep Learning Based Vulnerability Detection Techniques\",\"authors\":\"Ridhi Jain, Nicole Gervasoni, Mthandazo Ndhlovu, Sanjay Rawat\",\"doi\":\"10.1145/3578527.3578530\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent years have witnessed tremendous progress in NLP-based code comprehension via deep neural networks (DNN) learning, especially Large Language Models (LLMs). While the original application of LLMs is focused on code generation, there have been attempts to extend the application to more specialized tasks, like code similarity, author attribution, code repairs, and so on. As data plays an important role in the success of any machine learning approach, researchers have also proposed several benchmarks which are coupled with a specific task at hand. It is well known in the machine learning (ML) community that the presence of biases in the dataset affects the quality of the ML algorithm in a real-world scenario. This paper evaluates several existing datasets from DNN’s application perspective. We specifically focus on training datasets of C/C++ language code. Our choice of language stems from the fact that while LLM-based techniques have been applied and evaluated on programming languages like Python, JavaScript, and Ruby, there is not much LLM research for C/C++. As a result, datasets generated synthetically or from real-world codes are in individual research work. Consequently, in the absence of a uniform dataset, such works are hard to compare with each other. In this work, we aim to achieve two main objectives– 1. propose code-centric features that are relevant to security program analysis tasks like vulnerability detection; 2. a thorough (qualitative and quantitative) examination of the existing code datasets that demonstrate the main characteristics of the individual datasets to have a clear comparison. Our evaluation finds exciting facts about existing datasets highlighting gaps that need to be addressed.\",\"PeriodicalId\":326318,\"journal\":{\"name\":\"Proceedings of the 16th Innovations in Software Engineering Conference\",\"volume\":\"92 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th Innovations in Software Engineering Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3578527.3578530\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th Innovations in Software Engineering Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3578527.3578530","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

近年来，通过深度神经网络(DNN)学习，特别是大型语言模型(llm)，在基于nlp的代码理解方面取得了巨大进展。虽然llm的原始应用程序主要集中在代码生成上，但已经有人尝试将该应用程序扩展到更专门的任务，如代码相似性、作者归属、代码修复等。由于数据在任何机器学习方法的成功中都起着重要作用，研究人员还提出了几个与手头特定任务相结合的基准。众所周知，在机器学习(ML)社区中，数据集中偏差的存在会影响现实场景中ML算法的质量。本文从深度神经网络应用的角度对现有的几个数据集进行了评估。我们特别关注C/ c++语言代码的训练数据集。我们选择的语言源于这样一个事实，即尽管基于法学硕士的技术已经在Python、JavaScript和Ruby等编程语言上得到了应用和评估，但针对C/ c++的法学硕士研究并不多。因此，合成或从现实世界代码生成的数据集是在个人研究工作中。因此，在缺乏统一数据集的情况下，这些工作很难相互比较。在这项工作中，我们的目标是实现两个主要目标:1。提出与漏洞检测等安全程序分析任务相关的以代码为中心的功能;2. 对现有代码数据集进行彻底的(定性和定量)检查，展示各个数据集的主要特征，以便进行清晰的比较。我们的评估发现了有关现有数据集的令人兴奋的事实，突出了需要解决的差距。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Code Centric Evaluation of C/C++ Vulnerability Datasets for Deep Learning Based Vulnerability Detection Techniques

Recent years have witnessed tremendous progress in NLP-based code comprehension via deep neural networks (DNN) learning, especially Large Language Models (LLMs). While the original application of LLMs is focused on code generation, there have been attempts to extend the application to more specialized tasks, like code similarity, author attribution, code repairs, and so on. As data plays an important role in the success of any machine learning approach, researchers have also proposed several benchmarks which are coupled with a specific task at hand. It is well known in the machine learning (ML) community that the presence of biases in the dataset affects the quality of the ML algorithm in a real-world scenario. This paper evaluates several existing datasets from DNN’s application perspective. We specifically focus on training datasets of C/C++ language code. Our choice of language stems from the fact that while LLM-based techniques have been applied and evaluated on programming languages like Python, JavaScript, and Ruby, there is not much LLM research for C/C++. As a result, datasets generated synthetically or from real-world codes are in individual research work. Consequently, in the absence of a uniform dataset, such works are hard to compare with each other. In this work, we aim to achieve two main objectives– 1. propose code-centric features that are relevant to security program analysis tasks like vulnerability detection; 2. a thorough (qualitative and quantitative) examination of the existing code datasets that demonstrate the main characteristics of the individual datasets to have a clear comparison. Our evaluation finds exciting facts about existing datasets highlighting gaps that need to be addressed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 16th Innovations in Software Engineering Conference

自引率

0.00%

发文量