A comparative study of selected machine learning algorithms for cyber threat detection in open source data

2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG) Pub Date : 2023-04-05 DOI:10.1109/SEB-SDG57117.2023.10124615

M. O. Adebiyi, Mary O. Ajayi, F. Osang, A. Adebiyi

{"title":"A comparative study of selected machine learning algorithms for cyber threat detection in open source data","authors":"M. O. Adebiyi, Mary O. Ajayi, F. Osang, A. Adebiyi","doi":"10.1109/SEB-SDG57117.2023.10124615","DOIUrl":null,"url":null,"abstract":"Threat actors are developing and evolving new tools to quickly leverage the loopholes and vulnerabilities in security systems. Open sources are frequently used by these malicious threat actors to exchange their Tactics, Techniques, and Procedures (TTP) to attack devices. Cybersecurity professionals have a huge amount of threat data available on these open sources making it difficult to utilize and share. Humans can easily differentiate the useful and relevant information but it is daunting when the data is large with limited time hence the need to automate the process. The objective of this research is to carry out a comparative analysis on the performance of four machine learning algorithms (Decision Tree, Logic Regression, Random Forest and Naïve Bayes) to help cybersecurity professionals in making decision on the most suitable algorithm to analyze cyber threat intelligence dataset. The dataset used in this research work contains 48,000 objects. The evaluation metrics used for the comparative analysis in this study are accuracy, precision, recall and F1 score of the algorithms. The experimental results of this research work showed that Random Forest algorithm had the highest performance with an accuracy score of 97.16% which was followed by Decision Tree algorithm with an accuracy of 97.08%, Naïve Bayes classifier also had an accuracy of 93.92 % while the Logistic Regression classifier had the least score of all the four algorithms with an accuracy of 80.15%. Prospective researchers can learn from the findings of this work in order to come up with newer and enhanced algorithms, which can be useful in decision making for cyber security experts.","PeriodicalId":185729,"journal":{"name":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEB-SDG57117.2023.10124615","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Threat actors are developing and evolving new tools to quickly leverage the loopholes and vulnerabilities in security systems. Open sources are frequently used by these malicious threat actors to exchange their Tactics, Techniques, and Procedures (TTP) to attack devices. Cybersecurity professionals have a huge amount of threat data available on these open sources making it difficult to utilize and share. Humans can easily differentiate the useful and relevant information but it is daunting when the data is large with limited time hence the need to automate the process. The objective of this research is to carry out a comparative analysis on the performance of four machine learning algorithms (Decision Tree, Logic Regression, Random Forest and Naïve Bayes) to help cybersecurity professionals in making decision on the most suitable algorithm to analyze cyber threat intelligence dataset. The dataset used in this research work contains 48,000 objects. The evaluation metrics used for the comparative analysis in this study are accuracy, precision, recall and F1 score of the algorithms. The experimental results of this research work showed that Random Forest algorithm had the highest performance with an accuracy score of 97.16% which was followed by Decision Tree algorithm with an accuracy of 97.08%, Naïve Bayes classifier also had an accuracy of 93.92 % while the Logistic Regression classifier had the least score of all the four algorithms with an accuracy of 80.15%. Prospective researchers can learn from the findings of this work in order to come up with newer and enhanced algorithms, which can be useful in decision making for cyber security experts.

查看原文本刊更多论文

开源数据中网络威胁检测的选择机器学习算法的比较研究

威胁行为者正在开发和发展新的工具，以快速利用安全系统中的漏洞和漏洞。这些恶意威胁参与者经常使用开放源代码来交换他们的战术、技术和过程(TTP)来攻击设备。网络安全专业人员在这些开放资源上有大量的威胁数据，这使得它们难以利用和共享。人类可以很容易地区分有用和相关的信息，但当数据量大且时间有限时，这是令人生畏的，因此需要自动化过程。本研究的目的是对四种机器学习算法(决策树，逻辑回归，随机森林和Naïve贝叶斯)的性能进行比较分析，以帮助网络安全专业人员决策最适合的算法来分析网络威胁情报数据集。本研究工作中使用的数据集包含48,000个对象。本研究中用于比较分析的评价指标为算法的准确率、精密度、召回率和F1分数。本研究的实验结果表明，随机森林算法的准确率最高，为97.16%，其次是决策树算法，准确率为97.08%，Naïve贝叶斯分类器的准确率也为93.92%，Logistic回归分类器的准确率最低，为80.15%。未来的研究人员可以从这项工作的发现中学习，以提出更新和增强的算法，这对网络安全专家的决策有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)

自引率

0.00%

发文量