Malicious and Benign URL Dataset Generation Using Character-Level LSTM Models

2022 IEEE Conference on Dependable and Secure Computing (DSC) Pub Date : 2022-06-22 DOI:10.1109/DSC54232.2022.9888835

Spencer Vecile, Kyle Lacroix, Katarina Grolinger, J. Samarabandu

{"title":"Malicious and Benign URL Dataset Generation Using Character-Level LSTM Models","authors":"Spencer Vecile, Kyle Lacroix, Katarina Grolinger, J. Samarabandu","doi":"10.1109/DSC54232.2022.9888835","DOIUrl":null,"url":null,"abstract":"As technologies advance, so do the attacks on them. Cybersecurity plays a significant role in society to protect everyone. Malicious URLs are links designed to promote scams, attacks, and frauds. Companies often have web filtering algorithms that will blacklist specific URLs as malicious; however, due to privacy concerns, they will not give outside entities access to their cybersecurity data. Unfortunately, this lack of data creates a dire need for more data in cybersecurity research and machine learning applications. This paper proposes using machine learning to generate new synthetic URLs characteristically indistinguishable from the data they replace. To do this two character-level long short-term memory (LSTM) models were trained, one to generate malicious URLs and one to generate benign URLs. To assess the quality of the synthetic data two tests were performed. (1) Classify the URLs into malicious and benign to ensure the characteristics of the original data were preserved. (2) Use the Levenstein ratio to check the similarity between the real and synthetic URLs to ensure sufficient anonymization. The results from the classification test show that the synthetic data classifier only slightly underperformed the real data classifier; however, with having accuracy, precision, recall, sensitivity, and specificity above 99%, it can be concluded that the characteristics of the malicious and benign URLs were preserved. The Levenstein ratio tests showed a mean of 67% and 79% similarity for the benign and malicious URLs, respectively. In the end, the character-level LSTM model successfully generated an anonymized, synthetic dataset, that was characteristically similar to the original, which could pave the way for the publication of many more datasets in this way.","PeriodicalId":368903,"journal":{"name":"2022 IEEE Conference on Dependable and Secure Computing (DSC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Conference on Dependable and Secure Computing (DSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSC54232.2022.9888835","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

As technologies advance, so do the attacks on them. Cybersecurity plays a significant role in society to protect everyone. Malicious URLs are links designed to promote scams, attacks, and frauds. Companies often have web filtering algorithms that will blacklist specific URLs as malicious; however, due to privacy concerns, they will not give outside entities access to their cybersecurity data. Unfortunately, this lack of data creates a dire need for more data in cybersecurity research and machine learning applications. This paper proposes using machine learning to generate new synthetic URLs characteristically indistinguishable from the data they replace. To do this two character-level long short-term memory (LSTM) models were trained, one to generate malicious URLs and one to generate benign URLs. To assess the quality of the synthetic data two tests were performed. (1) Classify the URLs into malicious and benign to ensure the characteristics of the original data were preserved. (2) Use the Levenstein ratio to check the similarity between the real and synthetic URLs to ensure sufficient anonymization. The results from the classification test show that the synthetic data classifier only slightly underperformed the real data classifier; however, with having accuracy, precision, recall, sensitivity, and specificity above 99%, it can be concluded that the characteristics of the malicious and benign URLs were preserved. The Levenstein ratio tests showed a mean of 67% and 79% similarity for the benign and malicious URLs, respectively. In the end, the character-level LSTM model successfully generated an anonymized, synthetic dataset, that was characteristically similar to the original, which could pave the way for the publication of many more datasets in this way.

查看原文本刊更多论文

使用字符级LSTM模型生成恶意和良性URL数据集

随着技术的进步，对它们的攻击也在不断发展。网络安全在保护每个人的社会中发挥着重要作用。恶意url是旨在促进诈骗、攻击和欺诈的链接。公司通常有网络过滤算法，将特定的url列入恶意黑名单;然而，出于隐私考虑，他们不会让外部实体访问他们的网络安全数据。不幸的是，这种数据的缺乏使得网络安全研究和机器学习应用迫切需要更多的数据。本文提出使用机器学习来生成新的合成url，其特征与它们所替换的数据无法区分。为此，训练了两个字符级长短期记忆(LSTM)模型，一个用于生成恶意url，另一个用于生成良性url。为了评估合成数据的质量，进行了两项试验。(1)将url分为恶意和良性，保证原始数据的特征被保留。(2)使用Levenstein比率检查真实url和合成url的相似度，以确保足够的匿名化。分类测试结果表明，合成数据分类器的性能仅略低于真实数据分类器;但准确率、精密度、查全率、灵敏度、特异度均在99%以上，可以认为保留了恶意和良性url的特征。Levenstein比率测试显示，良性和恶意url的平均相似度分别为67%和79%。最后，字符级LSTM模型成功地生成了一个匿名的合成数据集，该数据集的特征与原始数据集相似，这可以为以这种方式发布更多数据集铺平道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE Conference on Dependable and Secure Computing (DSC)

自引率

0.00%

发文量