An Empirical Study of High-Impact Factors for Machine Learning-Based Vulnerability Detection

2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF) Pub Date : 2020-02-01 DOI:10.1109/IBF50092.2020.9034888

Wei Zheng, Jialiang Gao, Xiaoxue Wu, Yuxing Xun, Guoliang Liu, Xiang Chen

{"title":"An Empirical Study of High-Impact Factors for Machine Learning-Based Vulnerability Detection","authors":"Wei Zheng, Jialiang Gao, Xiaoxue Wu, Yuxing Xun, Guoliang Liu, Xiang Chen","doi":"10.1109/IBF50092.2020.9034888","DOIUrl":null,"url":null,"abstract":"Ahstract—Vulnerability detection is an important topic of software engineering. To improve the effectiveness and efficiency of vulnerability detection, many traditional machine learning-based and deep learning-based vulnerability detection methods have been proposed. However, the impact of different factors on vulnerability detection is unknown. For example, classification models and vectorization methods can directly affect the detection results and code replacement can affect the features of vulnerability detection. We conduct a comparative study to evaluate the impact of different classification algorithms, vectorization methods and user-defined variables and functions name replacement. In this paper, we collected three different vulnerability code datasets. These datasets correspond to different types of vulnerabilities and have different proportions of source code. Besides, we extract and analyze the features of vulnerability code datasets to explain some experimental results. Our findings from the experimental results can be summarized as follows: (i) the performance of using deep learning is better than using traditional machine learning and BLSTM can achieve the best performance. (ii) CountVectorizer can improve the performance of traditional machine learning. (iii) Different vulnerability types and different code sources will generate different features. We use the Random Forest algorithm to generate the features of vulnerability code datasets. These generated features include system-related functions, syntax keywords, and user-defined names. (iv) Datasets without user-defined variables and functions name replacement will achieve better vulnerability detection results.","PeriodicalId":190321,"journal":{"name":"2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF)","volume":"70 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IBF50092.2020.9034888","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Ahstract—Vulnerability detection is an important topic of software engineering. To improve the effectiveness and efficiency of vulnerability detection, many traditional machine learning-based and deep learning-based vulnerability detection methods have been proposed. However, the impact of different factors on vulnerability detection is unknown. For example, classification models and vectorization methods can directly affect the detection results and code replacement can affect the features of vulnerability detection. We conduct a comparative study to evaluate the impact of different classification algorithms, vectorization methods and user-defined variables and functions name replacement. In this paper, we collected three different vulnerability code datasets. These datasets correspond to different types of vulnerabilities and have different proportions of source code. Besides, we extract and analyze the features of vulnerability code datasets to explain some experimental results. Our findings from the experimental results can be summarized as follows: (i) the performance of using deep learning is better than using traditional machine learning and BLSTM can achieve the best performance. (ii) CountVectorizer can improve the performance of traditional machine learning. (iii) Different vulnerability types and different code sources will generate different features. We use the Random Forest algorithm to generate the features of vulnerability code datasets. These generated features include system-related functions, syntax keywords, and user-defined names. (iv) Datasets without user-defined variables and functions name replacement will achieve better vulnerability detection results.

查看原文本刊更多论文

基于机器学习的漏洞检测的高影响因子实证研究

漏洞检测是软件工程中的一个重要课题。为了提高漏洞检测的有效性和效率，人们提出了许多传统的基于机器学习和基于深度学习的漏洞检测方法。然而，不同因素对漏洞检测的影响是未知的。例如，分类模型和矢量化方法会直接影响检测结果，代码替换会影响漏洞检测的特征。我们进行了比较研究，以评估不同的分类算法，矢量化方法和用户自定义变量和函数名称替换的影响。在本文中，我们收集了三个不同的漏洞代码数据集。这些数据集对应不同类型的漏洞，源代码的比例也不同。此外，我们提取并分析了漏洞码数据集的特征，对一些实验结果进行了解释。我们的实验结果可以总结如下:(i)使用深度学习的性能优于使用传统机器学习，BLSTM可以达到最佳性能。(ii) CountVectorizer可以提高传统机器学习的性能。(iii)不同的漏洞类型和不同的代码源会产生不同的特征。我们使用随机森林算法生成漏洞代码数据集的特征。这些生成的特性包括与系统相关的函数、语法关键字和用户定义的名称。(iv)不替换自定义变量和函数名的数据集，漏洞检测效果更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 2nd International Workshop on Intelligent Bug Fixing (IBF)

自引率

0.00%

发文量