{"title":"Real-Time Automated Cyber Threat Classification and Emerging Threat Detection Framework","authors":"Alemayehu Tilahun Haile;Surafel Lemma Abebe;Henock Mulugeta Melaku","doi":"10.1109/OJCS.2025.3580235","DOIUrl":null,"url":null,"abstract":"Automating cyber threat intelligence (CTI) collection and analysis in real time is critical for the timely detection and mitigation of cyber threats. Cybersecurity researchers have recently recommended CTI as a proactive and robust method for automated cyber threat prediction. This automated solution collects and analyzes real-time data from social media, cybersecurity forums, and hacker forums where cybersecurity analysts and hackers discuss cybersecurity-related topics to discover potential threats. In this article, we propose a comprehensive framework that automates both cyber threat classification and emerging threat detection using real-time data from surface, deep, and dark web sources. We collected real-time data from hackers and security forums to construct binary and multiclass cyber threat classifications. We employed a labeled leaked dataset to be considered as ground truth for classification. Machine and deep learning techniques were used to perform the classification. Latent Dirichlet allocation (LDA) and nonnegative matrix factorization (NMF) were used to analyze topic distribution over time and identify emerging threats. This approach allows for the identification of zero-day attacks and other emerging threats by monitoring shifts in topics. Using a support vector machine with the bag-of-words (binary term weight) model achieved the highest accuracies of 93.67 and 96.35 for binary and multiclass classifications, respectively. Moreover, LDA and NMF were used to extract the top topics from various numbers of topics. The LDA model is well suited for identifying emerging trends and useful for real-time threat monitoring in cybersecurity.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"921-930"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11037544","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11037544/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automating cyber threat intelligence (CTI) collection and analysis in real time is critical for the timely detection and mitigation of cyber threats. Cybersecurity researchers have recently recommended CTI as a proactive and robust method for automated cyber threat prediction. This automated solution collects and analyzes real-time data from social media, cybersecurity forums, and hacker forums where cybersecurity analysts and hackers discuss cybersecurity-related topics to discover potential threats. In this article, we propose a comprehensive framework that automates both cyber threat classification and emerging threat detection using real-time data from surface, deep, and dark web sources. We collected real-time data from hackers and security forums to construct binary and multiclass cyber threat classifications. We employed a labeled leaked dataset to be considered as ground truth for classification. Machine and deep learning techniques were used to perform the classification. Latent Dirichlet allocation (LDA) and nonnegative matrix factorization (NMF) were used to analyze topic distribution over time and identify emerging threats. This approach allows for the identification of zero-day attacks and other emerging threats by monitoring shifts in topics. Using a support vector machine with the bag-of-words (binary term weight) model achieved the highest accuracies of 93.67 and 96.35 for binary and multiclass classifications, respectively. Moreover, LDA and NMF were used to extract the top topics from various numbers of topics. The LDA model is well suited for identifying emerging trends and useful for real-time threat monitoring in cybersecurity.