DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismail Berrada
{"title":"DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification","authors":"Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismail Berrada","doi":"arxiv-2409.09143","DOIUrl":null,"url":null,"abstract":"Detecting and classifying suspicious or malicious domain names and URLs is\nfundamental task in cybersecurity. To leverage such indicators of compromise,\ncybersecurity vendors and practitioners often maintain and update blacklists of\nknown malicious domains and URLs. However, blacklists frequently fail to\nidentify emerging and obfuscated threats. Over the past few decades, there has\nbeen significant interest in developing machine learning models that\nautomatically detect malicious domains and URLs, addressing the limitations of\nblacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a\npre-trained BERT-based encoder adapted for detecting and classifying\nsuspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the\nMasked Language Modeling (MLM) objective on a large multilingual corpus of\nURLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to\nassess the performance of DomURLs_BERT, we have conducted experiments on\nseveral binary and multi-class classification tasks involving domain names and\nURLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations\nresults show that the proposed encoder outperforms state-of-the-art\ncharacter-based deep learning models and cybersecurity-focused BERT models\nacross multiple tasks and datasets. The pre-training dataset, the pre-trained\nDomURLs_BERT encoder, and the experiments source code are publicly available.","PeriodicalId":501332,"journal":{"name":"arXiv - CS - Cryptography and Security","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Cryptography and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.
DomURLs_BERT:基于预训练 BERT 的恶意域和 URL 检测与分类模型
检测可疑或恶意域名和 URL 并对其进行分类是网络安全的基本任务。网络安全供应商和从业人员通常会维护和更新已知恶意域名和 URL 的黑名单,以利用这些入侵指标。然而,黑名单经常无法识别新出现的和被混淆的威胁。过去几十年来,人们对开发自动检测恶意域和 URL 的机器学习模型产生了浓厚的兴趣,以解决黑名单维护和更新的局限性。在本文中,我们介绍了 DomURLs_BERT,它是一种经过训练的基于 BERT 的编码器,适用于检测和分类可疑/恶意域名和 URL。DomURLs_BERT 采用掩码语言建模(MLM)目标,在包含 URL、域名和域生成算法(DGA)数据集的大型多语言语料库上进行预训练。为了评估 DomURLs_BERT 的性能,我们在涉及域名和 URL 的多个二类和多类分类任务上进行了实验,其中包括网络钓鱼、恶意软件、DGA 和 DNS 隧道。评估结果表明,所提出的编码器在多个任务和数据集上的表现优于最先进的基于字符的深度学习模型和以网络安全为重点的 BERT 模型。预训练数据集、预训练的DomURLs_BERT编码器和实验源代码均可公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信