Identifying and Categorizing Malicious Content on Paste Sites: A Neural Topic Modeling Approach

2021 IEEE International Conference on Intelligence and Security Informatics (ISI) Pub Date : 2021-11-02 DOI:10.1109/ISI53945.2021.9624765

Tala Vahedi, Benjamin Ampel, S. Samtani, Hsinchun Chen

{"title":"Identifying and Categorizing Malicious Content on Paste Sites: A Neural Topic Modeling Approach","authors":"Tala Vahedi, Benjamin Ampel, S. Samtani, Hsinchun Chen","doi":"10.1109/ISI53945.2021.9624765","DOIUrl":null,"url":null,"abstract":"Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERT-LDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERT-LDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure.","PeriodicalId":347770,"journal":{"name":"2021 IEEE International Conference on Intelligence and Security Informatics (ISI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Intelligence and Security Informatics (ISI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISI53945.2021.9624765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERT-LDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERT-LDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure.

查看原文本刊更多论文

在粘贴网站上识别和分类恶意内容:一种神经主题建模方法

恶意网络活动给美国经济和全球市场造成了巨大损失。网络犯罪分子经常使用诸如粘贴网站(例如Pastebin)之类的信息共享社交媒体平台来共享大量与个人身份信息(PII)、信用卡号、漏洞利用代码、恶意软件和其他敏感内容相关的纯文本内容。粘贴站点可以提供有针对性的网络威胁情报(CTI)，了解潜在威胁和先前的违规行为。在这项研究中，我们提出了一种新的基于潜在狄利克雷分配(LDA)模型的双向编码器表示(BERT)方法来自动对粘贴进行分类。我们提出的BERT-LDA模型利用神经网络转换器架构在表示粘贴中的每个句子时捕获顺序依赖关系。BERT-LDA用包含序列级别的类标签的标签袋(BoL)取代了传统LDA中的词袋(BoW)方法。我们比较了提出的BERT-LDA与传统LDA和BERT-LDA变体(例如GPT2-LDA)在来自三个粘贴位点的4,254,453个粘贴上的性能。实验结果表明，BERT-LDA在每个粘贴位点的困惑度方面优于标准LDA和每种BERT-LDA变体。我们的BERT-LDA案例研究结果表明，与黑客社区活动、恶意代码、网络和网站漏洞以及PII相关的重要内容在粘贴网站上共享。本研究提供的见解可以被组织用来主动减轻对其基础设施的潜在损害。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Conference on Intelligence and Security Informatics (ISI)

自引率

0.00%

发文量