利用 ML 集合和 C-BERT 分析 D2A 数据集中的源代码漏洞

IF 3.5 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Empirical Software Engineering Pub Date : 2024-02-22 DOI:10.1007/s10664-023-10405-9

Saurabh Pujar, Yunhui Zheng, Luca Buratti, Burn Lewis, Yunchung Chen, Jim Laredo, Alessandro Morari, Edward Epstein, Tsungnan Lin, Bo Yang, Zhong Su

{"title":"利用 ML 集合和 C-BERT 分析 D2A 数据集中的源代码漏洞","authors":"Saurabh Pujar, Yunhui Zheng, Luca Buratti, Burn Lewis, Yunchung Chen, Jim Laredo, Alessandro Morari, Edward Epstein, Tsungnan Lin, Bo Yang, Zhong Su","doi":"10.1007/s10664-023-10405-9","DOIUrl":null,"url":null,"abstract":"<p>Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.</p>","PeriodicalId":11525,"journal":{"name":"Empirical Software Engineering","volume":"4 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT\",\"authors\":\"Saurabh Pujar, Yunhui Zheng, Luca Buratti, Burn Lewis, Yunchung Chen, Jim Laredo, Alessandro Morari, Edward Epstein, Tsungnan Lin, Bo Yang, Zhong Su\",\"doi\":\"10.1007/s10664-023-10405-9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.</p>\",\"PeriodicalId\":11525,\"journal\":{\"name\":\"Empirical Software Engineering\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Empirical Software Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10664-023-10405-9\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Empirical Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10664-023-10405-9","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

静态分析工具可分析具有复杂行为和数百万行代码的程序，因此被广泛用于漏洞检测。尽管静态分析工具很受欢迎，但众所周知，它们会产生过多的误报。最近，机器学习模型能够从编程语言数据中学习，这为静态分析减少误报提供了新的可能性。然而，用于训练漏洞识别模型的现有数据集存在多种局限性，如有限的错误上下文、有限的大小以及合成和不真实的源代码。我们提出了差分数据集分析或 D2A，这是一种基于差分分析的方法，用于标记静态分析工具报告的问题。使用这种方法建立的数据集称为 D2A 数据集。D2A 数据集是通过分析多个开源项目的版本对建立的。我们从每个项目中选择错误修复提交，并对这些提交前后的版本运行静态分析。如果在提交前版本中检测到的某些问题在相应的提交后版本中消失了，那么它们很有可能是提交后得到修复的真正错误。我们使用 D2A 生成一个大型标签数据集。然后，我们使用 D2A 数据集训练用于漏洞识别的经典机器学习模型和深度学习模型。我们的研究表明，该数据集可用于构建分类器，以识别静态分析报告的问题中可能存在的误报，从而帮助开发人员确定优先级并首先调查潜在的真阳性问题。为了促进未来研究并为社区做出贡献，我们公开了数据集生成管道和数据集。我们还基于 D2A 数据集创建了一个排行榜，该排行榜已经吸引了社区的关注和参与。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT

查看原文本刊更多论文

Analyzing source code vulnerabilities in the D2A dataset with ML ensembles and C-BERT

Static analysis tools are widely used for vulnerability detection as they can analyze programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to learn from programming language data opens new possibilities of reducing false positives when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose Differential Dataset Analysis or D2A, a differential analysis based approach to label issues reported by static analysis tools. The dataset built with this approach is called the D2A dataset. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset. We then train both classic machine learning models and deep learning models for vulnerability identification using the D2A dataset. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first. To facilitate future research and contribute to the community, we make the dataset generation pipeline and the dataset publicly available. We have also created a leaderboard based on the D2A dataset, which has already attracted attention and participation from the community.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Empirical Software Engineering 工程技术-计算机：软件工程

CiteScore

8.50

自引率

12.20%

发文量

169

审稿时长

>12 weeks

期刊介绍： Empirical Software Engineering provides a forum for applied software engineering research with a strong empirical component, and a venue for publishing empirical results relevant to both researchers and practitioners. Empirical studies presented here usually involve the collection and analysis of data and experience that can be used to characterize, evaluate and reveal relationships between software development deliverables, practices, and technologies. Over time, it is expected that such empirical results will form a body of knowledge leading to widely accepted and well-formed theories. The journal also offers industrial experience reports detailing the application of software technologies - processes, methods, or tools - and their effectiveness in industrial settings. Empirical Software Engineering promotes the publication of industry-relevant research, to address the significant gap between research and practice.