DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

arXiv - CS - Cryptography and Security Pub Date : 2024-09-13 DOI:arxiv-2409.09143

Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismail Berrada

{"title":"DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification","authors":"Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismail Berrada","doi":"arxiv-2409.09143","DOIUrl":null,"url":null,"abstract":"Detecting and classifying suspicious or malicious domain names and URLs is\nfundamental task in cybersecurity. To leverage such indicators of compromise,\ncybersecurity vendors and practitioners often maintain and update blacklists of\nknown malicious domains and URLs. However, blacklists frequently fail to\nidentify emerging and obfuscated threats. Over the past few decades, there has\nbeen significant interest in developing machine learning models that\nautomatically detect malicious domains and URLs, addressing the limitations of\nblacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a\npre-trained BERT-based encoder adapted for detecting and classifying\nsuspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the\nMasked Language Modeling (MLM) objective on a large multilingual corpus of\nURLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to\nassess the performance of DomURLs_BERT, we have conducted experiments on\nseveral binary and multi-class classification tasks involving domain names and\nURLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations\nresults show that the proposed encoder outperforms state-of-the-art\ncharacter-based deep learning models and cybersecurity-focused BERT models\nacross multiple tasks and datasets. The pre-training dataset, the pre-trained\nDomURLs_BERT encoder, and the experiments source code are publicly available.","PeriodicalId":501332,"journal":{"name":"arXiv - CS - Cryptography and Security","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Cryptography and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA) dataset. In order to assess the performance of DomURLs_BERT, we have conducted experiments on several binary and multi-class classification tasks involving domain names and URLs, covering phishing, malware, DGA, and DNS tunneling. The evaluations results show that the proposed encoder outperforms state-of-the-art character-based deep learning models and cybersecurity-focused BERT models across multiple tasks and datasets. The pre-training dataset, the pre-trained DomURLs_BERT encoder, and the experiments source code are publicly available.

查看原文本刊更多论文

DomURLs_BERT：基于预训练 BERT 的恶意域和 URL 检测与分类模型

检测可疑或恶意域名和 URL 并对其进行分类是网络安全的基本任务。网络安全供应商和从业人员通常会维护和更新已知恶意域名和 URL 的黑名单，以利用这些入侵指标。然而，黑名单经常无法识别新出现的和被混淆的威胁。过去几十年来，人们对开发自动检测恶意域和 URL 的机器学习模型产生了浓厚的兴趣，以解决黑名单维护和更新的局限性。在本文中，我们介绍了 DomURLs_BERT，它是一种经过训练的基于 BERT 的编码器，适用于检测和分类可疑/恶意域名和 URL。DomURLs_BERT 采用掩码语言建模（MLM）目标，在包含 URL、域名和域生成算法（DGA）数据集的大型多语言语料库上进行预训练。为了评估 DomURLs_BERT 的性能，我们在涉及域名和 URL 的多个二类和多类分类任务上进行了实验，其中包括网络钓鱼、恶意软件、DGA 和 DNS 隧道。评估结果表明，所提出的编码器在多个任务和数据集上的表现优于最先进的基于字符的深度学习模型和以网络安全为重点的 BERT 模型。预训练数据集、预训练的DomURLs_BERT编码器和实验源代码均可公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Cryptography and Security

自引率

0.00%

发文量