Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

ACM Transactions on Software Engineering and Methodology Pub Date : 2024-03-04 DOI:10.1145/3649590

Daan Hommersom, Antonino Sabetta, Bonaventura Coppola, Dario Di Nucci, Damian A. Tamburri

{"title":"Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories","authors":"Daan Hommersom, Antonino Sabetta, Bonaventura Coppola, Dario Di Nucci, Damian A. Tamburri","doi":"10.1145/3649590","DOIUrl":null,"url":null,"abstract":"The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, we construct an advisory record\nobject containing key information about a vulnerability that is extracted from an advisory, such those found in the National Vulnerability Database (NVD). These advisories are expressed in natural language. Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project, by filtering out commits that can be identified as unrelated to the vulnerability at hand. Finally, for each of the remaining candidate commits, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. Based on the values of these feature vectors, our method produces a ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to easily interpret the predictions. We implemented our approach and we evaluated it on an open data set, built by manual curation, that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03% of the vulnerabilities (with a fix commit on the first position for 65.06% of the vulnerabilities). Our evaluation shows that our method can reduce considerably the manual effort needed to search OSS repositories for the commits that fix known vulnerabilities.","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"69 1","pages":""},"PeriodicalIF":6.2000,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3649590","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, we construct an advisory record object containing key information about a vulnerability that is extracted from an advisory, such those found in the National Vulnerability Database (NVD). These advisories are expressed in natural language. Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project, by filtering out commits that can be identified as unrelated to the vulnerability at hand. Finally, for each of the remaining candidate commits, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. Based on the values of these feature vectors, our method produces a ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to easily interpret the predictions.

We implemented our approach and we evaluated it on an open data set, built by manual curation, that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03% of the vulnerabilities (with a fix commit on the first position for 65.06% of the vulnerabilities). Our evaluation shows that our method can reduce considerably the manual effort needed to search OSS repositories for the commits that fix known vulnerabilities.

查看原文本刊更多论文

将漏洞公告自动映射到开源软件库中的修复提交中

缺乏全面准确的漏洞数据来源是研究和理解软件漏洞（及其修正）的一个关键障碍。在本文中，我们介绍了一种方法，它结合了源自实践经验的启发式方法和机器学习（ML）--特别是自然语言处理（NLP）--来解决这一问题。我们的方法包括三个阶段。首先，我们构建一个咨询记录对象，其中包含从咨询（如国家漏洞数据库（NVD）中找到的咨询）中提取的有关漏洞的关键信息。这些咨询用自然语言表达。其次，使用启发式方法，从受影响项目的源代码库中获取候选修复提交的子集，过滤掉与当前漏洞无关的提交。最后，对于剩余的每个候选提交，我们的方法都会建立一个数字特征向量，反映与预测其与当前咨询匹配相关的提交特征。根据这些特征向量的值，我们的方法会生成一份候选修复提交的排序列表。用户可以看到 ML 模型对每个特征的评分，从而轻松解读预测结果。我们实施了我们的方法，并在一个开放数据集上对其进行了评估，该数据集由人工整理建立，包含 2,391 个已知修复提交，与 1,248 个公开漏洞公告相对应。考虑到排序结果中的前 10 次提交，我们的方法可以成功识别出 84.03% 的漏洞的至少一次修复提交（其中 65.06% 的漏洞的修复提交排在第一位）。我们的评估结果表明，我们的方法可以大大减少人工搜索开放源码软件库中修复已知漏洞的提交所需的工作量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Software Engineering and Methodology 工程技术-计算机：软件工程

CiteScore

6.30

自引率

4.50%

发文量

164

审稿时长

>12 weeks

期刊介绍： Designing and building a large, complex software system is a tremendous challenge. ACM Transactions on Software Engineering and Methodology (TOSEM) publishes papers on all aspects of that challenge: specification, design, development and maintenance. It covers tools and methodologies, languages, data structures, and algorithms. TOSEM also reports on successful efforts, noting practical lessons that can be scaled and transferred to other projects, and often looks at applications of innovative technologies. The tone is scholarly but readable; the content is worthy of study; the presentation is effective.