Malicious and Benign URL Dataset Generation Using Character-Level LSTM Models

Spencer Vecile, Kyle Lacroix, Katarina Grolinger, J. Samarabandu
{"title":"Malicious and Benign URL Dataset Generation Using Character-Level LSTM Models","authors":"Spencer Vecile, Kyle Lacroix, Katarina Grolinger, J. Samarabandu","doi":"10.1109/DSC54232.2022.9888835","DOIUrl":null,"url":null,"abstract":"As technologies advance, so do the attacks on them. Cybersecurity plays a significant role in society to protect everyone. Malicious URLs are links designed to promote scams, attacks, and frauds. Companies often have web filtering algorithms that will blacklist specific URLs as malicious; however, due to privacy concerns, they will not give outside entities access to their cybersecurity data. Unfortunately, this lack of data creates a dire need for more data in cybersecurity research and machine learning applications. This paper proposes using machine learning to generate new synthetic URLs characteristically indistinguishable from the data they replace. To do this two character-level long short-term memory (LSTM) models were trained, one to generate malicious URLs and one to generate benign URLs. To assess the quality of the synthetic data two tests were performed. (1) Classify the URLs into malicious and benign to ensure the characteristics of the original data were preserved. (2) Use the Levenstein ratio to check the similarity between the real and synthetic URLs to ensure sufficient anonymization. The results from the classification test show that the synthetic data classifier only slightly underperformed the real data classifier; however, with having accuracy, precision, recall, sensitivity, and specificity above 99%, it can be concluded that the characteristics of the malicious and benign URLs were preserved. The Levenstein ratio tests showed a mean of 67% and 79% similarity for the benign and malicious URLs, respectively. In the end, the character-level LSTM model successfully generated an anonymized, synthetic dataset, that was characteristically similar to the original, which could pave the way for the publication of many more datasets in this way.","PeriodicalId":368903,"journal":{"name":"2022 IEEE Conference on Dependable and Secure Computing (DSC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Conference on Dependable and Secure Computing (DSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSC54232.2022.9888835","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

As technologies advance, so do the attacks on them. Cybersecurity plays a significant role in society to protect everyone. Malicious URLs are links designed to promote scams, attacks, and frauds. Companies often have web filtering algorithms that will blacklist specific URLs as malicious; however, due to privacy concerns, they will not give outside entities access to their cybersecurity data. Unfortunately, this lack of data creates a dire need for more data in cybersecurity research and machine learning applications. This paper proposes using machine learning to generate new synthetic URLs characteristically indistinguishable from the data they replace. To do this two character-level long short-term memory (LSTM) models were trained, one to generate malicious URLs and one to generate benign URLs. To assess the quality of the synthetic data two tests were performed. (1) Classify the URLs into malicious and benign to ensure the characteristics of the original data were preserved. (2) Use the Levenstein ratio to check the similarity between the real and synthetic URLs to ensure sufficient anonymization. The results from the classification test show that the synthetic data classifier only slightly underperformed the real data classifier; however, with having accuracy, precision, recall, sensitivity, and specificity above 99%, it can be concluded that the characteristics of the malicious and benign URLs were preserved. The Levenstein ratio tests showed a mean of 67% and 79% similarity for the benign and malicious URLs, respectively. In the end, the character-level LSTM model successfully generated an anonymized, synthetic dataset, that was characteristically similar to the original, which could pave the way for the publication of many more datasets in this way.
使用字符级LSTM模型生成恶意和良性URL数据集
随着技术的进步,对它们的攻击也在不断发展。网络安全在保护每个人的社会中发挥着重要作用。恶意url是旨在促进诈骗、攻击和欺诈的链接。公司通常有网络过滤算法,将特定的url列入恶意黑名单;然而,出于隐私考虑,他们不会让外部实体访问他们的网络安全数据。不幸的是,这种数据的缺乏使得网络安全研究和机器学习应用迫切需要更多的数据。本文提出使用机器学习来生成新的合成url,其特征与它们所替换的数据无法区分。为此,训练了两个字符级长短期记忆(LSTM)模型,一个用于生成恶意url,另一个用于生成良性url。为了评估合成数据的质量,进行了两项试验。(1)将url分为恶意和良性,保证原始数据的特征被保留。(2)使用Levenstein比率检查真实url和合成url的相似度,以确保足够的匿名化。分类测试结果表明,合成数据分类器的性能仅略低于真实数据分类器;但准确率、精密度、查全率、灵敏度、特异度均在99%以上,可以认为保留了恶意和良性url的特征。Levenstein比率测试显示,良性和恶意url的平均相似度分别为67%和79%。最后,字符级LSTM模型成功地生成了一个匿名的合成数据集,该数据集的特征与原始数据集相似,这可以为以这种方式发布更多数据集铺平道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信