使用源代码度量预测安全漏洞

2021 Swedish Workshop on Data Science (SweDS) Pub Date : 2021-12-02 DOI:10.1109/SweDS53855.2021.9638301

Sundarakrishnan Ganesh, Tobias Ohlsson, Francis Palma

{"title":"使用源代码度量预测安全漏洞","authors":"Sundarakrishnan Ganesh, Tobias Ohlsson, Francis Palma","doi":"10.1109/SweDS53855.2021.9638301","DOIUrl":null,"url":null,"abstract":"Large open-source systems generate and operate on a plethora of sensitive enterprise data. Thus, security threats or vulnerabilities must not be present in open-source systems and must be resolved as early as possible in the development phases to avoid catastrophic consequences. One way to recognize security vulnerabilities is to predict them while developers write code to minimize costs and resources. This study examines the effectiveness of machine learning algorithms to predict potential security vulnerabilities by analyzing the source code of a system. We obtained the security vulnerabilities dataset from Apache Tomcat security reports for version 4.x to 10.x. We also collected the source code of Apache Tomcat 4.x to 10.x to compute 43 object-oriented metrics. We assessed four traditional supervised learning algorithms, i.e., Naive Bayes (NB), Decision Tree (DT), K-Nearest Neighbors (KNN), and Logistic Regression (LR), to understand their efficacy in predicting security vulnerabilities. We obtained the highest accuracy of 80.6% using the KNN. Thus, the KNN classifier was demonstrated to be the most effective of all the models we built. The DT classifier also performed well but under-performed when it came to multi-class classification.","PeriodicalId":194514,"journal":{"name":"2021 Swedish Workshop on Data Science (SweDS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting Security Vulnerabilities using Source Code Metrics\",\"authors\":\"Sundarakrishnan Ganesh, Tobias Ohlsson, Francis Palma\",\"doi\":\"10.1109/SweDS53855.2021.9638301\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large open-source systems generate and operate on a plethora of sensitive enterprise data. Thus, security threats or vulnerabilities must not be present in open-source systems and must be resolved as early as possible in the development phases to avoid catastrophic consequences. One way to recognize security vulnerabilities is to predict them while developers write code to minimize costs and resources. This study examines the effectiveness of machine learning algorithms to predict potential security vulnerabilities by analyzing the source code of a system. We obtained the security vulnerabilities dataset from Apache Tomcat security reports for version 4.x to 10.x. We also collected the source code of Apache Tomcat 4.x to 10.x to compute 43 object-oriented metrics. We assessed four traditional supervised learning algorithms, i.e., Naive Bayes (NB), Decision Tree (DT), K-Nearest Neighbors (KNN), and Logistic Regression (LR), to understand their efficacy in predicting security vulnerabilities. We obtained the highest accuracy of 80.6% using the KNN. Thus, the KNN classifier was demonstrated to be the most effective of all the models we built. The DT classifier also performed well but under-performed when it came to multi-class classification.\",\"PeriodicalId\":194514,\"journal\":{\"name\":\"2021 Swedish Workshop on Data Science (SweDS)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 Swedish Workshop on Data Science (SweDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SweDS53855.2021.9638301\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 Swedish Workshop on Data Science (SweDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SweDS53855.2021.9638301","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型开源系统生成并操作大量敏感的企业数据。因此，安全威胁或漏洞一定不能出现在开源系统中，必须在开发阶段尽早解决，以避免灾难性的后果。识别安全漏洞的一种方法是在开发人员编写代码时进行预测，以最小化成本和资源。本研究通过分析系统的源代码来检验机器学习算法预测潜在安全漏洞的有效性。我们从Apache Tomcat版本4的安全报告中获得了安全漏洞数据集。X到10。我们还收集了Apache Tomcat 4的源代码。X到10。X来计算43个面向对象的度量。我们评估了四种传统的监督学习算法，即朴素贝叶斯(NB)、决策树(DT)、k近邻(KNN)和逻辑回归(LR)，以了解它们在预测安全漏洞方面的有效性。我们使用KNN获得了80.6%的最高准确率。因此，KNN分类器被证明是我们建立的所有模型中最有效的。DT分类器也表现良好，但在多类分类时表现不佳。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Predicting Security Vulnerabilities using Source Code Metrics

Large open-source systems generate and operate on a plethora of sensitive enterprise data. Thus, security threats or vulnerabilities must not be present in open-source systems and must be resolved as early as possible in the development phases to avoid catastrophic consequences. One way to recognize security vulnerabilities is to predict them while developers write code to minimize costs and resources. This study examines the effectiveness of machine learning algorithms to predict potential security vulnerabilities by analyzing the source code of a system. We obtained the security vulnerabilities dataset from Apache Tomcat security reports for version 4.x to 10.x. We also collected the source code of Apache Tomcat 4.x to 10.x to compute 43 object-oriented metrics. We assessed four traditional supervised learning algorithms, i.e., Naive Bayes (NB), Decision Tree (DT), K-Nearest Neighbors (KNN), and Logistic Regression (LR), to understand their efficacy in predicting security vulnerabilities. We obtained the highest accuracy of 80.6% using the KNN. Thus, the KNN classifier was demonstrated to be the most effective of all the models we built. The DT classifier also performed well but under-performed when it came to multi-class classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 Swedish Workshop on Data Science (SweDS)

自引率

0.00%

发文量