Fine-grained Commit-level Vulnerability Type Prediction by CWE Tree Structure

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) Pub Date : 2023-05-01 DOI:10.1109/ICSE48619.2023.00088

Shengyi Pan, Lingfeng Bao, Xin Xia, David Lo, Shanping Li

{"title":"Fine-grained Commit-level Vulnerability Type Prediction by CWE Tree Structure","authors":"Shengyi Pan, Lingfeng Bao, Xin Xia, David Lo, Shanping Li","doi":"10.1109/ICSE48619.2023.00088","DOIUrl":null,"url":null,"abstract":"Identifying security patches via code commits to allow early warnings and timely fixes for Open Source Software (OSS) has received increasing attention. However, the existing detection methods can only identify the presence of a patch (i.e., a binary classification) but fail to pinpoint the vulnerability type. In this work, we take the first step to categorize the security patches into fine-grained vulnerability types. Specifically, we use the Common Weakness Enumeration (CWE) as the label and perform fine-grained classification using categories at the third level of the CWE tree. We first formulate the task as a Hierarchical Multi-label Classification (HMC) problem, i.e., inferring a path (a sequence of CWE nodes) from the root of the CWE tree to the node at the target depth. We then propose an approach named TreeVul with a hierarchical and chained architecture, which manages to utilize the structure information of the CWE tree as prior knowledge of the classification task. We further propose a tree structure aware and beam search based inference algorithm for retrieving the optimal path with the highest merged probability. We collect a large security patch dataset from NVD, consisting of 6,541 commits from 1,560 GitHub OSS repositories. Experimental results show that Tree-vulsignificantly outperforms the best performing baselines, with improvements of 5.9%, 25.0%, and 7.7% in terms of weighted F1-score, macro F1-score, and MCC, respectively. We further conduct a user study and a case study to verify the practical value of TreeVul in enriching the binary patch detection results and improving the data quality of NVD, respectively.","PeriodicalId":376379,"journal":{"name":"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSE48619.2023.00088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Identifying security patches via code commits to allow early warnings and timely fixes for Open Source Software (OSS) has received increasing attention. However, the existing detection methods can only identify the presence of a patch (i.e., a binary classification) but fail to pinpoint the vulnerability type. In this work, we take the first step to categorize the security patches into fine-grained vulnerability types. Specifically, we use the Common Weakness Enumeration (CWE) as the label and perform fine-grained classification using categories at the third level of the CWE tree. We first formulate the task as a Hierarchical Multi-label Classification (HMC) problem, i.e., inferring a path (a sequence of CWE nodes) from the root of the CWE tree to the node at the target depth. We then propose an approach named TreeVul with a hierarchical and chained architecture, which manages to utilize the structure information of the CWE tree as prior knowledge of the classification task. We further propose a tree structure aware and beam search based inference algorithm for retrieving the optimal path with the highest merged probability. We collect a large security patch dataset from NVD, consisting of 6,541 commits from 1,560 GitHub OSS repositories. Experimental results show that Tree-vulsignificantly outperforms the best performing baselines, with improvements of 5.9%, 25.0%, and 7.7% in terms of weighted F1-score, macro F1-score, and MCC, respectively. We further conduct a user study and a case study to verify the practical value of TreeVul in enriching the binary patch detection results and improving the data quality of NVD, respectively.

查看原文本刊更多论文

基于CWE树结构的细粒度委员会级漏洞类型预测

通过代码提交来识别安全补丁，以允许对开源软件(OSS)进行早期警告和及时修复，已经受到越来越多的关注。然而，现有的检测方法只能识别补丁的存在(即二进制分类)，而无法精确定位漏洞类型。在这项工作中，我们首先将安全补丁分类为细粒度的漏洞类型。具体来说，我们使用公共弱点枚举(Common Weakness Enumeration, CWE)作为标签，并使用CWE树的第三层的类别执行细粒度分类。我们首先将该任务表述为分层多标签分类(HMC)问题，即从CWE树的根到目标深度的节点推断路径(CWE节点序列)。然后，我们提出了一种名为TreeVul的方法，该方法具有分层和链式架构，它设法利用CWE树的结构信息作为分类任务的先验知识。我们进一步提出了一种基于树结构感知和束搜索的推理算法，用于检索合并概率最高的最优路径。我们从NVD收集了一个大型的安全补丁数据集，包括来自1,560个GitHub OSS存储库的6,541个提交。实验结果表明，tree - vult在加权f1得分、宏观f1得分和MCC方面分别提高了5.9%、25.0%和7.7%，显著优于表现最好的基线。我们进一步进行了用户研究和案例研究，分别验证TreeVul在丰富二进制补丁检测结果和提高NVD数据质量方面的实用价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)

自引率

0.00%

发文量