PatchDB:大规模安全补丁数据集

2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) Pub Date : 2021-06-01 DOI:10.1109/DSN48987.2021.00030

Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, S. Jajodia

{"title":"PatchDB:大规模安全补丁数据集","authors":"Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, S. Jajodia","doi":"10.1109/DSN48987.2021.00030","DOIUrl":null,"url":null,"abstract":"Security patches, embedding both vulnerable code and the corresponding fixes, are of great significance to vulnerability detection and software maintenance. However, the existing patch datasets suffer from insufficient samples and low varieties. In this paper, we construct a large-scale patch dataset called PatchDB that consists of three components, namely, NVD-based dataset, wild-based dataset, and synthetic dataset. The NVD-based dataset is extracted from the patch hyperlinks indexed by the NVD. The wild-based dataset includes security patches that we collect from the commits on GitHub. To improve the efficiency of data collection and reduce the effort on manual verification, we develop a new nearest link search method to help find the most promising security patch candidates. Moreover, we provide a synthetic dataset that uses a new oversampling method to synthesize patches at the source code level by enriching the control flow variants of original patches. We conduct a set of studies to investigate the effectiveness of the proposed algorithms and evaluate the properties of the collected dataset. The experimental results show that PatchDB can help improve the performance of security patch identification.","PeriodicalId":222512,"journal":{"name":"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"28 7","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"PatchDB: A Large-Scale Security Patch Dataset\",\"authors\":\"Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, S. Jajodia\",\"doi\":\"10.1109/DSN48987.2021.00030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Security patches, embedding both vulnerable code and the corresponding fixes, are of great significance to vulnerability detection and software maintenance. However, the existing patch datasets suffer from insufficient samples and low varieties. In this paper, we construct a large-scale patch dataset called PatchDB that consists of three components, namely, NVD-based dataset, wild-based dataset, and synthetic dataset. The NVD-based dataset is extracted from the patch hyperlinks indexed by the NVD. The wild-based dataset includes security patches that we collect from the commits on GitHub. To improve the efficiency of data collection and reduce the effort on manual verification, we develop a new nearest link search method to help find the most promising security patch candidates. Moreover, we provide a synthetic dataset that uses a new oversampling method to synthesize patches at the source code level by enriching the control flow variants of original patches. We conduct a set of studies to investigate the effectiveness of the proposed algorithms and evaluate the properties of the collected dataset. The experimental results show that PatchDB can help improve the performance of security patch identification.\",\"PeriodicalId\":222512,\"journal\":{\"name\":\"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"volume\":\"28 7\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSN48987.2021.00030\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN48987.2021.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

安全补丁既嵌入了漏洞代码，也嵌入了相应的修复程序，对漏洞检测和软件维护具有重要意义。然而，现有的斑块数据集存在样本不足、品种少的问题。本文构建了一个大规模补丁数据集PatchDB，该数据集由基于nvd的数据集、基于野生的数据集和合成数据集三部分组成。基于NVD的数据集是从由NVD索引的补丁超链接中提取的。基于野生的数据集包括我们从GitHub上的提交中收集的安全补丁。为了提高数据收集的效率和减少人工验证的工作量，我们开发了一种新的最近链接搜索方法来帮助找到最有希望的安全补丁候选。此外，我们提供了一个合成数据集，该数据集使用一种新的过采样方法，通过丰富原始补丁的控制流变体，在源代码级别合成补丁。我们进行了一系列研究来调查所提出算法的有效性，并评估所收集数据集的属性。实验结果表明，PatchDB有助于提高安全补丁识别的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PatchDB: A Large-Scale Security Patch Dataset

Security patches, embedding both vulnerable code and the corresponding fixes, are of great significance to vulnerability detection and software maintenance. However, the existing patch datasets suffer from insufficient samples and low varieties. In this paper, we construct a large-scale patch dataset called PatchDB that consists of three components, namely, NVD-based dataset, wild-based dataset, and synthetic dataset. The NVD-based dataset is extracted from the patch hyperlinks indexed by the NVD. The wild-based dataset includes security patches that we collect from the commits on GitHub. To improve the efficiency of data collection and reduce the effort on manual verification, we develop a new nearest link search method to help find the most promising security patch candidates. Moreover, we provide a synthetic dataset that uses a new oversampling method to synthesize patches at the source code level by enriching the control flow variants of original patches. We conduct a set of studies to investigate the effectiveness of the proposed algorithms and evaluate the properties of the collected dataset. The experimental results show that PatchDB can help improve the performance of security patch identification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

自引率

0.00%

发文量