A cognitive platform for collecting cyber threat intelligence and real-time detection using cloud computing

Decision Analytics Journal Pub Date : 2025-01-08 DOI:10.1016/j.dajour.2025.100545

Prasasthy Balasubramanian, Sadaf Nazari, Danial Khosh Kholgh, Alireza Mahmoodi, Justin Seby, Panos Kostakos

{"title":"A cognitive platform for collecting cyber threat intelligence and real-time detection using cloud computing","authors":"Prasasthy Balasubramanian, Sadaf Nazari, Danial Khosh Kholgh, Alireza Mahmoodi, Justin Seby, Panos Kostakos","doi":"10.1016/j.dajour.2025.100545","DOIUrl":null,"url":null,"abstract":"<div><div>The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. However, for most organizations, collecting actionable CTI remains both a technical bottleneck and a black box. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. This study proposes an efficient platform capable of processing compute-intensive data pipelines, based on cloud computing, for real-time detection, collection, and sharing of CTI from various online sources. We developed a prototype platform (TSTEM) with a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, Elasticsearch, Logstash, and Kibana (ELK), Kafka, and Machine Learning Operations (MLOps) to autonomously search, extract, and index indicators of compromise (IOCs) in the wild. Moreover, the provisioning, monitoring, and management of the platform are achieved through infrastructure as code (IaC). Custom focus-crawlers collect web content, processed by a first-level classifier to identify potential IOCs. Relevant content advances to a second level for further examination. State-of-the-art natural language processing (NLP) models are used for classification and entity extraction, enhancing the IOC extraction methodology. Our results indicate these models exhibit high accuracy (exceeding 98%) in classification and extraction tasks, achieving this performance within less than a minute. The system’s effectiveness is due to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification with low false positives.</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"14 ","pages":"Article 100545"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772662225000013","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. However, for most organizations, collecting actionable CTI remains both a technical bottleneck and a black box. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. This study proposes an efficient platform capable of processing compute-intensive data pipelines, based on cloud computing, for real-time detection, collection, and sharing of CTI from various online sources. We developed a prototype platform (TSTEM) with a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, Elasticsearch, Logstash, and Kibana (ELK), Kafka, and Machine Learning Operations (MLOps) to autonomously search, extract, and index indicators of compromise (IOCs) in the wild. Moreover, the provisioning, monitoring, and management of the platform are achieved through infrastructure as code (IaC). Custom focus-crawlers collect web content, processed by a first-level classifier to identify potential IOCs. Relevant content advances to a second level for further examination. State-of-the-art natural language processing (NLP) models are used for classification and entity extraction, enhancing the IOC extraction methodology. Our results indicate these models exhibit high accuracy (exceeding 98%) in classification and extraction tasks, achieving this performance within less than a minute. The system’s effectiveness is due to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification with low false positives.

查看原文本刊更多论文

一个利用云计算收集网络威胁情报和实时检测的认知平台

从开放资源中提取网络威胁情报（CTI）是一种快速发展的防御策略，可以增强IT和OT环境应对大规模网络攻击的弹性。然而，对于大多数组织来说，收集可操作的CTI仍然是一个技术瓶颈和黑箱。虽然以前的研究集中在改进提取过程的单个组件上，但社区缺乏在实际环境中部署流CTI数据管道的开源平台。本研究提出了一个高效的平台，能够处理基于云计算的计算密集型数据管道，用于实时检测、收集和共享来自各种在线来源的CTI。我们开发了一个具有容器化微服务架构的原型平台（TSTEM），该平台使用Tweepy， Scrapy, Terraform, Elasticsearch， Logstash和Kibana (ELK)， Kafka和机器学习操作（MLOps）来自主搜索，提取和索引折衷指标（IOCs）。此外，平台的供应、监视和管理是通过基础设施即代码（IaC）实现的。自定义焦点爬虫收集web内容，由一级分类器处理以识别潜在的ioc。相关内容推进到第二级进行进一步检查。最先进的自然语言处理（NLP）模型用于分类和实体提取，增强了IOC提取方法。我们的结果表明，这些模型在分类和提取任务中表现出很高的准确性（超过98%），在不到一分钟的时间内实现了这一性能。该系统的有效性是由于一种微调的IOC提取方法，该方法可在多个阶段进行操作，确保了低误报的精确识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Decision Analytics Journal

CiteScore

3.90

自引率

0.00%

发文量