Identifying and Categorizing Malicious Content on Paste Sites: A Neural Topic Modeling Approach

Tala Vahedi, Benjamin Ampel, S. Samtani, Hsinchun Chen
{"title":"Identifying and Categorizing Malicious Content on Paste Sites: A Neural Topic Modeling Approach","authors":"Tala Vahedi, Benjamin Ampel, S. Samtani, Hsinchun Chen","doi":"10.1109/ISI53945.2021.9624765","DOIUrl":null,"url":null,"abstract":"Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERT-LDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERT-LDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure.","PeriodicalId":347770,"journal":{"name":"2021 IEEE International Conference on Intelligence and Security Informatics (ISI)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Intelligence and Security Informatics (ISI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISI53945.2021.9624765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Malicious cyber activities impose substantial costs on the U.S. economy and global markets. Cyber-criminals often use information-sharing social media platforms such as paste sites (e.g., Pastebin) to share vast amounts of plain text content related to Personally Identifiable Information (PII), credit card numbers, exploit code, malware, and other sensitive content. Paste sites can provide targeted Cyber Threat Intelligence (CTI) about potential threats and prior breaches. In this research, we propose a novel Bidirectional Encoder Representation from Transformers (BERT) with Latent Dirichlet Allocation (LDA) model to categorize pastes automatically. Our proposed BERT-LDA model leverages a neural network transformer architecture to capture sequential dependencies when representing each sentence in a paste. BERT-LDA replaces the Bag-of-Words (BoW) approach in the conventional LDA with a Bag-of-Labels (BoL) that encompasses class labels at the sequence level. We compared the performance of the proposed BERT-LDA against the conventional LDA and BERT-LDA variants (e.g., GPT2-LDA) on 4,254,453 pastes from three paste sites. Experiment results indicate that the proposed BERT-LDA outperformed the standard LDA and each BERT-LDA variant in terms of perplexity on each paste site. Results of our BERT-LDA case study suggest that significant content relating to hacker community activities, malicious code, network and website vulnerabilities, and PII are shared on paste sites. The insights provided by this study could be used by organizations to proactively mitigate potential damage on their infrastructure.
在粘贴网站上识别和分类恶意内容:一种神经主题建模方法
恶意网络活动给美国经济和全球市场造成了巨大损失。网络犯罪分子经常使用诸如粘贴网站(例如Pastebin)之类的信息共享社交媒体平台来共享大量与个人身份信息(PII)、信用卡号、漏洞利用代码、恶意软件和其他敏感内容相关的纯文本内容。粘贴站点可以提供有针对性的网络威胁情报(CTI),了解潜在威胁和先前的违规行为。在这项研究中,我们提出了一种新的基于潜在狄利克雷分配(LDA)模型的双向编码器表示(BERT)方法来自动对粘贴进行分类。我们提出的BERT-LDA模型利用神经网络转换器架构在表示粘贴中的每个句子时捕获顺序依赖关系。BERT-LDA用包含序列级别的类标签的标签袋(BoL)取代了传统LDA中的词袋(BoW)方法。我们比较了提出的BERT-LDA与传统LDA和BERT-LDA变体(例如GPT2-LDA)在来自三个粘贴位点的4,254,453个粘贴上的性能。实验结果表明,BERT-LDA在每个粘贴位点的困惑度方面优于标准LDA和每种BERT-LDA变体。我们的BERT-LDA案例研究结果表明,与黑客社区活动、恶意代码、网络和网站漏洞以及PII相关的重要内容在粘贴网站上共享。本研究提供的见解可以被组织用来主动减轻对其基础设施的潜在损害。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信