{"title":"A cognitive platform for collecting cyber threat intelligence and real-time detection using cloud computing","authors":"Prasasthy Balasubramanian, Sadaf Nazari, Danial Khosh Kholgh, Alireza Mahmoodi, Justin Seby, Panos Kostakos","doi":"10.1016/j.dajour.2025.100545","DOIUrl":null,"url":null,"abstract":"<div><div>The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. However, for most organizations, collecting actionable CTI remains both a technical bottleneck and a black box. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. This study proposes an efficient platform capable of processing compute-intensive data pipelines, based on cloud computing, for real-time detection, collection, and sharing of CTI from various online sources. We developed a prototype platform (TSTEM) with a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, Elasticsearch, Logstash, and Kibana (ELK), Kafka, and Machine Learning Operations (MLOps) to autonomously search, extract, and index indicators of compromise (IOCs) in the wild. Moreover, the provisioning, monitoring, and management of the platform are achieved through infrastructure as code (IaC). Custom focus-crawlers collect web content, processed by a first-level classifier to identify potential IOCs. Relevant content advances to a second level for further examination. State-of-the-art natural language processing (NLP) models are used for classification and entity extraction, enhancing the IOC extraction methodology. Our results indicate these models exhibit high accuracy (exceeding 98%) in classification and extraction tasks, achieving this performance within less than a minute. The system’s effectiveness is due to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification with low false positives.</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"14 ","pages":"Article 100545"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772662225000013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. However, for most organizations, collecting actionable CTI remains both a technical bottleneck and a black box. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. This study proposes an efficient platform capable of processing compute-intensive data pipelines, based on cloud computing, for real-time detection, collection, and sharing of CTI from various online sources. We developed a prototype platform (TSTEM) with a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, Elasticsearch, Logstash, and Kibana (ELK), Kafka, and Machine Learning Operations (MLOps) to autonomously search, extract, and index indicators of compromise (IOCs) in the wild. Moreover, the provisioning, monitoring, and management of the platform are achieved through infrastructure as code (IaC). Custom focus-crawlers collect web content, processed by a first-level classifier to identify potential IOCs. Relevant content advances to a second level for further examination. State-of-the-art natural language processing (NLP) models are used for classification and entity extraction, enhancing the IOC extraction methodology. Our results indicate these models exhibit high accuracy (exceeding 98%) in classification and extraction tasks, achieving this performance within less than a minute. The system’s effectiveness is due to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification with low false positives.