M. O. Adebiyi, Mary O. Ajayi, F. Osang, A. Adebiyi
{"title":"A comparative study of selected machine learning algorithms for cyber threat detection in open source data","authors":"M. O. Adebiyi, Mary O. Ajayi, F. Osang, A. Adebiyi","doi":"10.1109/SEB-SDG57117.2023.10124615","DOIUrl":null,"url":null,"abstract":"Threat actors are developing and evolving new tools to quickly leverage the loopholes and vulnerabilities in security systems. Open sources are frequently used by these malicious threat actors to exchange their Tactics, Techniques, and Procedures (TTP) to attack devices. Cybersecurity professionals have a huge amount of threat data available on these open sources making it difficult to utilize and share. Humans can easily differentiate the useful and relevant information but it is daunting when the data is large with limited time hence the need to automate the process. The objective of this research is to carry out a comparative analysis on the performance of four machine learning algorithms (Decision Tree, Logic Regression, Random Forest and Naïve Bayes) to help cybersecurity professionals in making decision on the most suitable algorithm to analyze cyber threat intelligence dataset. The dataset used in this research work contains 48,000 objects. The evaluation metrics used for the comparative analysis in this study are accuracy, precision, recall and F1 score of the algorithms. The experimental results of this research work showed that Random Forest algorithm had the highest performance with an accuracy score of 97.16% which was followed by Decision Tree algorithm with an accuracy of 97.08%, Naïve Bayes classifier also had an accuracy of 93.92 % while the Logistic Regression classifier had the least score of all the four algorithms with an accuracy of 80.15%. Prospective researchers can learn from the findings of this work in order to come up with newer and enhanced algorithms, which can be useful in decision making for cyber security experts.","PeriodicalId":185729,"journal":{"name":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Science, Engineering and Business for Sustainable Development Goals (SEB-SDG)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SEB-SDG57117.2023.10124615","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Threat actors are developing and evolving new tools to quickly leverage the loopholes and vulnerabilities in security systems. Open sources are frequently used by these malicious threat actors to exchange their Tactics, Techniques, and Procedures (TTP) to attack devices. Cybersecurity professionals have a huge amount of threat data available on these open sources making it difficult to utilize and share. Humans can easily differentiate the useful and relevant information but it is daunting when the data is large with limited time hence the need to automate the process. The objective of this research is to carry out a comparative analysis on the performance of four machine learning algorithms (Decision Tree, Logic Regression, Random Forest and Naïve Bayes) to help cybersecurity professionals in making decision on the most suitable algorithm to analyze cyber threat intelligence dataset. The dataset used in this research work contains 48,000 objects. The evaluation metrics used for the comparative analysis in this study are accuracy, precision, recall and F1 score of the algorithms. The experimental results of this research work showed that Random Forest algorithm had the highest performance with an accuracy score of 97.16% which was followed by Decision Tree algorithm with an accuracy of 97.08%, Naïve Bayes classifier also had an accuracy of 93.92 % while the Logistic Regression classifier had the least score of all the four algorithms with an accuracy of 80.15%. Prospective researchers can learn from the findings of this work in order to come up with newer and enhanced algorithms, which can be useful in decision making for cyber security experts.