Nima Shiri Harzevili, Alvine Boaye Belle, Junjie Wang, Song Wang, Zhen Ming (Jack) Jiang, Nachiappan Nagappan
{"title":"利用机器学习自动检测软件漏洞的系统性文献综述","authors":"Nima Shiri Harzevili, Alvine Boaye Belle, Junjie Wang, Song Wang, Zhen Ming (Jack) Jiang, Nachiappan Nagappan","doi":"10.1145/3699711","DOIUrl":null,"url":null,"abstract":"In recent years, numerous Machine Learning (ML) models, including Deep Learning (DL) and classic ML models, have been developed to detect software vulnerabilities. However, there is a notable lack of comprehensive and systematic surveys that summarize, classify, and analyze the applications of these ML models in software vulnerability detection. This absence may lead to critical research areas being overlooked or under-represented, resulting in a skewed understanding of the current state of the art in software vulnerability detection. To close this gap, we propose a comprehensive and systematic literature review that characterizes the different properties of ML-based software vulnerability detection systems using six major research questions (RQs). Using a custom web scraper, our systematic approach involves extracting a set of studies from four widely used online digital libraries—ACM Digital Library, IEEEXplore, ScienceDirect, and Google Scholar. We manually analyzed the extracted studies to filter out irrelevant work unrelated to software vulnerability detection, followed by creating taxonomies and addressing research questions. Our analysis indicates a significant upward trend in applying ML techniques for software vulnerability detection over the past few years, with many studies published in recent years. Prominent conference venues include the International Conference on Software Engineering (ICSE), the International Symposium on Software Reliability Engineering (ISSRE), The Mining Software Repositories (MSR) conference, and the ACM International Conference on the Foundations of Software Engineering (FSE), while the Information and Software Technology (IST), the Computers & Security (C&S), and the Journal of Systems and Software (JSS) are the leading journal venues. Our results reveal that 39.1% of the subject studies use hybrid sources while 37.6% of the subject studies utilize benchmark data for software vulnerability detection. Code-based data are the most commonly used data type among subject studies, with source code being the predominant subtype. Graph-based and token-based input representations are the most popular techniques, accounting for 57.2% and 24.6% of the subject studies, respectively. Among the input embedding techniques, graph embedding and token vector embedding are the most frequently used techniques accounting for 32.6% and 29.7% of the subject studies. Additionally, 88.4% of the subject studies use DL models, with Recurrent Neural Networks (RNNs) and Graph Neural Networks (GNNs) being the most popular subcategories, while only 7.2% use classic ML models. Among the vulnerability types covered by the subject studies, CWE-119, CWE-20, and CWE-190 are the most frequent ones. In terms of tools used for software vulnerability detection, Keras with TensorFlow backend and PyTorch libraries are the most frequently used model-building tools accounting for 42 studies for each. Also, Joern is the most popular tool used for code representation accounting for 24 studies. Finally, we summarize the challenges and future directions in the context of software vulnerability detection, providing valuable insights for researchers and practitioners in the field.","PeriodicalId":50926,"journal":{"name":"ACM Computing Surveys","volume":null,"pages":null},"PeriodicalIF":23.8000,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Systematic Literature Review on Automated Software Vulnerability Detection Using Machine Learning\",\"authors\":\"Nima Shiri Harzevili, Alvine Boaye Belle, Junjie Wang, Song Wang, Zhen Ming (Jack) Jiang, Nachiappan Nagappan\",\"doi\":\"10.1145/3699711\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, numerous Machine Learning (ML) models, including Deep Learning (DL) and classic ML models, have been developed to detect software vulnerabilities. However, there is a notable lack of comprehensive and systematic surveys that summarize, classify, and analyze the applications of these ML models in software vulnerability detection. This absence may lead to critical research areas being overlooked or under-represented, resulting in a skewed understanding of the current state of the art in software vulnerability detection. To close this gap, we propose a comprehensive and systematic literature review that characterizes the different properties of ML-based software vulnerability detection systems using six major research questions (RQs). Using a custom web scraper, our systematic approach involves extracting a set of studies from four widely used online digital libraries—ACM Digital Library, IEEEXplore, ScienceDirect, and Google Scholar. We manually analyzed the extracted studies to filter out irrelevant work unrelated to software vulnerability detection, followed by creating taxonomies and addressing research questions. Our analysis indicates a significant upward trend in applying ML techniques for software vulnerability detection over the past few years, with many studies published in recent years. Prominent conference venues include the International Conference on Software Engineering (ICSE), the International Symposium on Software Reliability Engineering (ISSRE), The Mining Software Repositories (MSR) conference, and the ACM International Conference on the Foundations of Software Engineering (FSE), while the Information and Software Technology (IST), the Computers & Security (C&S), and the Journal of Systems and Software (JSS) are the leading journal venues. Our results reveal that 39.1% of the subject studies use hybrid sources while 37.6% of the subject studies utilize benchmark data for software vulnerability detection. Code-based data are the most commonly used data type among subject studies, with source code being the predominant subtype. Graph-based and token-based input representations are the most popular techniques, accounting for 57.2% and 24.6% of the subject studies, respectively. Among the input embedding techniques, graph embedding and token vector embedding are the most frequently used techniques accounting for 32.6% and 29.7% of the subject studies. Additionally, 88.4% of the subject studies use DL models, with Recurrent Neural Networks (RNNs) and Graph Neural Networks (GNNs) being the most popular subcategories, while only 7.2% use classic ML models. Among the vulnerability types covered by the subject studies, CWE-119, CWE-20, and CWE-190 are the most frequent ones. In terms of tools used for software vulnerability detection, Keras with TensorFlow backend and PyTorch libraries are the most frequently used model-building tools accounting for 42 studies for each. Also, Joern is the most popular tool used for code representation accounting for 24 studies. Finally, we summarize the challenges and future directions in the context of software vulnerability detection, providing valuable insights for researchers and practitioners in the field.\",\"PeriodicalId\":50926,\"journal\":{\"name\":\"ACM Computing Surveys\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":23.8000,\"publicationDate\":\"2024-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Computing Surveys\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3699711\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Computing Surveys","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3699711","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
摘要
近年来,人们开发了许多机器学习(ML)模型,包括深度学习(DL)模型和经典 ML 模型,用于检测软件漏洞。然而,目前明显缺乏对这些 ML 模型在软件漏洞检测中的应用进行总结、分类和分析的全面、系统的调查。这种缺失可能会导致关键研究领域被忽视或代表性不足,从而导致对软件漏洞检测技术现状的理解出现偏差。为了填补这一空白,我们提出了一项全面系统的文献综述,通过六个主要研究问题(RQs)来描述基于 ML 的软件漏洞检测系统的不同特性。我们的系统性方法使用定制的网络搜刮器,从四个广泛使用的在线数字图书馆(ACM Digital Library、IEEEXplore、ScienceDirect 和 Google Scholar)中提取一系列研究。我们对提取的研究报告进行人工分析,过滤掉与软件漏洞检测无关的工作,然后创建分类法并解决研究问题。我们的分析表明,在过去几年中,应用 ML 技术进行软件漏洞检测的研究呈显著上升趋势,近年来发表了许多研究报告。著名的会议包括国际软件工程会议(ICSE)、国际软件可靠性工程研讨会(ISSRE)、挖掘软件库会议(MSR)和 ACM 软件工程基础国际会议(FSE),而《信息与软件技术》(IST)、《计算机与安全》(C&S)和《系统与软件期刊》(JSS)则是主要的期刊刊物。我们的研究结果显示,39.1% 的主题研究使用混合来源,37.6% 的主题研究使用基准数据进行软件漏洞检测。基于代码的数据是主题研究中最常用的数据类型,其中源代码是最主要的子类型。基于图形的输入表示法和基于标记的输入表示法是最常用的技术,分别占主题研究的 57.2% 和 24.6%。在输入嵌入技术中,图形嵌入和标记向量嵌入是最常用的技术,分别占课题研究的 32.6% 和 29.7%。此外,88.4%的课题研究使用了 DL 模型,其中递归神经网络(RNN)和图神经网络(GNN)是最受欢迎的子类别,只有 7.2% 的课题研究使用了经典的 ML 模型。在课题研究涉及的漏洞类型中,CWE-119、CWE-20 和 CWE-190 是最常见的类型。在软件漏洞检测工具方面,带有 TensorFlow 后端的 Keras 和 PyTorch 库是最常用的模型构建工具,各占 42 项研究。此外,Joern 是最常用的代码表示工具,共有 24 项研究使用。最后,我们总结了软件漏洞检测方面的挑战和未来方向,为该领域的研究人员和从业人员提供了宝贵的见解。
A Systematic Literature Review on Automated Software Vulnerability Detection Using Machine Learning
In recent years, numerous Machine Learning (ML) models, including Deep Learning (DL) and classic ML models, have been developed to detect software vulnerabilities. However, there is a notable lack of comprehensive and systematic surveys that summarize, classify, and analyze the applications of these ML models in software vulnerability detection. This absence may lead to critical research areas being overlooked or under-represented, resulting in a skewed understanding of the current state of the art in software vulnerability detection. To close this gap, we propose a comprehensive and systematic literature review that characterizes the different properties of ML-based software vulnerability detection systems using six major research questions (RQs). Using a custom web scraper, our systematic approach involves extracting a set of studies from four widely used online digital libraries—ACM Digital Library, IEEEXplore, ScienceDirect, and Google Scholar. We manually analyzed the extracted studies to filter out irrelevant work unrelated to software vulnerability detection, followed by creating taxonomies and addressing research questions. Our analysis indicates a significant upward trend in applying ML techniques for software vulnerability detection over the past few years, with many studies published in recent years. Prominent conference venues include the International Conference on Software Engineering (ICSE), the International Symposium on Software Reliability Engineering (ISSRE), The Mining Software Repositories (MSR) conference, and the ACM International Conference on the Foundations of Software Engineering (FSE), while the Information and Software Technology (IST), the Computers & Security (C&S), and the Journal of Systems and Software (JSS) are the leading journal venues. Our results reveal that 39.1% of the subject studies use hybrid sources while 37.6% of the subject studies utilize benchmark data for software vulnerability detection. Code-based data are the most commonly used data type among subject studies, with source code being the predominant subtype. Graph-based and token-based input representations are the most popular techniques, accounting for 57.2% and 24.6% of the subject studies, respectively. Among the input embedding techniques, graph embedding and token vector embedding are the most frequently used techniques accounting for 32.6% and 29.7% of the subject studies. Additionally, 88.4% of the subject studies use DL models, with Recurrent Neural Networks (RNNs) and Graph Neural Networks (GNNs) being the most popular subcategories, while only 7.2% use classic ML models. Among the vulnerability types covered by the subject studies, CWE-119, CWE-20, and CWE-190 are the most frequent ones. In terms of tools used for software vulnerability detection, Keras with TensorFlow backend and PyTorch libraries are the most frequently used model-building tools accounting for 42 studies for each. Also, Joern is the most popular tool used for code representation accounting for 24 studies. Finally, we summarize the challenges and future directions in the context of software vulnerability detection, providing valuable insights for researchers and practitioners in the field.
期刊介绍:
ACM Computing Surveys is an academic journal that focuses on publishing surveys and tutorials on various areas of computing research and practice. The journal aims to provide comprehensive and easily understandable articles that guide readers through the literature and help them understand topics outside their specialties. In terms of impact, CSUR has a high reputation with a 2022 Impact Factor of 16.6. It is ranked 3rd out of 111 journals in the field of Computer Science Theory & Methods.
ACM Computing Surveys is indexed and abstracted in various services, including AI2 Semantic Scholar, Baidu, Clarivate/ISI: JCR, CNKI, DeepDyve, DTU, EBSCO: EDS/HOST, and IET Inspec, among others.