TSTEM: A Cognitive Platform for Collecting Cyber Threat Intelligence in the Wild

ArXiv Pub Date : 2024-02-15 DOI:10.48550/arXiv.2402.09973

Prasasthy Balasubramanian, Sadaf Nazari, Danial Khosh Kholgh, A. Mahmoodi, Justin Seby, Panos Kostakos

{"title":"TSTEM: A Cognitive Platform for Collecting Cyber Threat Intelligence in the Wild","authors":"Prasasthy Balasubramanian, Sadaf Nazari, Danial Khosh Kholgh, A. Mahmoodi, Justin Seby, Panos Kostakos","doi":"10.48550/arXiv.2402.09973","DOIUrl":null,"url":null,"abstract":"The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. To address this gap, the study describes the implementation of an efficient and well-performing platform capable of processing compute-intensive data pipelines based on the cloud computing paradigm for real-time detection, collecting, and sharing CTI from different online sources. We developed a prototype platform (TSTEM), a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, ELK, Kafka, and MLOps to autonomously search, extract, and index IOCs in the wild. Moreover, the provisioning, monitoring, and management of the TSTEM platform are achieved through infrastructure as a code (IaC). Custom focus crawlers collect web content, which is then processed by a first-level classifier to identify potential indicators of compromise (IOCs). If deemed relevant, the content advances to a second level of extraction for further examination. Throughout this process, state-of-the-art NLP models are utilized for classification and entity extraction, enhancing the overall IOC extraction methodology. Our experimental results indicate that these models exhibit high accuracy (exceeding 98%) in the classification and extraction tasks, achieving this performance within a time frame of less than a minute. The effectiveness of our system can be attributed to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification of relevant information with low false positives.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":"6 12","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2402.09973","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The extraction of cyber threat intelligence (CTI) from open sources is a rapidly expanding defensive strategy that enhances the resilience of both Information Technology (IT) and Operational Technology (OT) environments against large-scale cyber-attacks. While previous research has focused on improving individual components of the extraction process, the community lacks open-source platforms for deploying streaming CTI data pipelines in the wild. To address this gap, the study describes the implementation of an efficient and well-performing platform capable of processing compute-intensive data pipelines based on the cloud computing paradigm for real-time detection, collecting, and sharing CTI from different online sources. We developed a prototype platform (TSTEM), a containerized microservice architecture that uses Tweepy, Scrapy, Terraform, ELK, Kafka, and MLOps to autonomously search, extract, and index IOCs in the wild. Moreover, the provisioning, monitoring, and management of the TSTEM platform are achieved through infrastructure as a code (IaC). Custom focus crawlers collect web content, which is then processed by a first-level classifier to identify potential indicators of compromise (IOCs). If deemed relevant, the content advances to a second level of extraction for further examination. Throughout this process, state-of-the-art NLP models are utilized for classification and entity extraction, enhancing the overall IOC extraction methodology. Our experimental results indicate that these models exhibit high accuracy (exceeding 98%) in the classification and extraction tasks, achieving this performance within a time frame of less than a minute. The effectiveness of our system can be attributed to a finely-tuned IOC extraction method that operates at multiple stages, ensuring precise identification of relevant information with low false positives.

查看原文本刊更多论文

TSTEM：在野外收集网络威胁情报的认知平台

从开放源中提取网络威胁情报（CTI）是一种快速发展的防御策略，可增强信息技术（IT）和操作技术（OT）环境抵御大规模网络攻击的能力。虽然以前的研究主要集中在改进提取过程的各个组件，但社区缺乏在野外部署流 CTI 数据管道的开源平台。为了弥补这一不足，本研究介绍了一个高效且性能良好的平台的实施情况，该平台能够处理基于云计算范式的计算密集型数据管道，用于实时检测、收集和共享来自不同在线来源的 CTI。我们开发了一个原型平台（TSTEM），它是一个容器化的微服务架构，使用 Tweepy、Scrapy、Terraform、ELK、Kafka 和 MLOps 在野外自主搜索、提取和索引 IOC。此外，TSTEM 平台的配置、监控和管理都是通过基础设施即代码（IaC）实现的。自定义焦点爬虫收集网络内容，然后由一级分类器进行处理，以识别潜在的入侵指标（IOC）。如果认为相关，内容将进入第二级提取，以便进一步检查。在整个过程中，最先进的 NLP 模型被用于分类和实体提取，从而增强了整个 IOC 提取方法。我们的实验结果表明，这些模型在分类和提取任务中表现出了很高的准确率（超过 98%），并在不到一分钟的时间内实现了这一性能。我们系统的有效性可归功于经过微调的 IOC 提取方法，该方法可在多个阶段运行，确保以较低的误报率精确识别相关信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量