From past to present: A survey of malicious URL detection techniques, datasets and code repositories

IF 12.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computer Science Review Pub Date : 2025-08-26 DOI:10.1016/j.cosrev.2025.100810

Ye Tian , Yanqiu Yu , Jianguo Sun , Yanbin Wang

{"title":"From past to present: A survey of malicious URL detection techniques, datasets and code repositories","authors":"Ye Tian , Yanqiu Yu , Jianguo Sun , Yanbin Wang","doi":"10.1016/j.cosrev.2025.100810","DOIUrl":null,"url":null,"abstract":"<div><div>Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. The detection of malicious URLs is a protracted arms race between defenders and attackers. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews suffer from four critical limitations: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.</div><div>This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g., Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (e.g., lexical URL features, HTML structure, JavaScript behavior, visual layout).For instance, we group models that parse DOM trees and extract HTML tag paths into the HTML modality, while those using rendered webpage screenshots are classified under the visual modality. This taxonomy reveals how distinct input channels inform model design, offering new perspectives that are obscured in algorithm-based classifications. This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze two key resources: 1) publicly available datasets (2016–2024), and 2) open-source implementations from published works (2013–2025). To facilitate cross-method comparison and support future benchmarking efforts, we compile a comparative table summarizing key performance metrics (e.g., Accuracy, F1 Score, AUC) as reported in the original works or open-source repositories. While no formal evaluation protocol is proposed, this effort provides a practical reference point and highlights the need for more standardized benchmarking practices in malicious URL detection research. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curation of datasets and open-source implementations: <span><span>https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":48633,"journal":{"name":"Computer Science Review","volume":"58 ","pages":"Article 100810"},"PeriodicalIF":12.7000,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science Review","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574013725000863","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. The detection of malicious URLs is a protracted arms race between defenders and attackers. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews suffer from four critical limitations: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.

This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g., Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (e.g., lexical URL features, HTML structure, JavaScript behavior, visual layout).For instance, we group models that parse DOM trees and extract HTML tag paths into the HTML modality, while those using rendered webpage screenshots are classified under the visual modality. This taxonomy reveals how distinct input channels inform model design, offering new perspectives that are obscured in algorithm-based classifications. This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze two key resources: 1) publicly available datasets (2016–2024), and 2) open-source implementations from published works (2013–2025). To facilitate cross-method comparison and support future benchmarking efforts, we compile a comparative table summarizing key performance metrics (e.g., Accuracy, F1 Score, AUC) as reported in the original works or open-source repositories. While no formal evaluation protocol is proposed, this effort provides a practical reference point and highlights the need for more standardized benchmarking practices in malicious URL detection research. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curation of datasets and open-source implementations: https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master.

查看原文本刊更多论文

从过去到现在：恶意URL检测技术、数据集和代码库的调查

恶意url通过欺骗用户泄露私人数据或分发有害有效载荷渗透主机系统，持续威胁着网络安全生态系统。恶意url的检测是防御者和攻击者之间旷日持久的军备竞赛。及时了解这场正在进行的战斗的现状具有重要意义。然而，现有的综述存在四个关键的局限性：1)它们对以算法为中心的分类法的依赖模糊了对检测方法如何利用特定模态信息通道的理解；2)未能整合关键的LLM/基于变形金刚的防御；3)没有收集开源实现以方便基准测试；4)数据集覆盖不足。本文全面回顾了恶意URL检测技术，系统地分析了从传统的黑名单到高级深度学习方法（例如Transformer， gnn和llm）的方法。与之前的调查不同，我们提出了一种新的基于模式的分类法，根据现有作品的主要数据模式（例如，词法URL特征、HTML结构、JavaScript行为、视觉布局）对它们进行分类。例如，我们将解析DOM树和提取HTML标签路径的模型归为HTML模态，而那些使用渲染网页截图的模型归为可视化模态。这种分类法揭示了不同的输入通道如何为模型设计提供信息，提供了在基于算法的分类中模糊不清的新视角。这种分层分类既可以进行严格的技术分析，又可以清楚地了解多模式信息的利用。此外，为了建立可访问数据集的概况并解决缺乏标准化基准的问题（目前的研究通常缺乏适当的基线比较），我们整理和分析了两个关键资源：1)公开可用的数据集（2016-2024），以及2)来自已发表作品的开源实现（2013-2025）。为了便于跨方法比较并支持未来的基准测试工作，我们编制了一个比较表，总结了原始作品或开源存储库中报告的关键性能指标（例如，准确性、F1分数、AUC）。虽然没有提出正式的评估协议，但这项工作提供了一个实用的参考点，并强调了在恶意URL检测研究中需要更多标准化的基准测试实践。报告最后审查了新出现的挑战，并为未来的研究提出了可行的方向。我们维护一个GitHub存储库，用于持续管理数据集和开源实现：https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Science Review Computer Science-General Computer Science

CiteScore

32.70

自引率

0.00%

发文量

审稿时长

51 days

期刊介绍： Computer Science Review, a publication dedicated to research surveys and expository overviews of open problems in computer science, targets a broad audience within the field seeking comprehensive insights into the latest developments. The journal welcomes articles from various fields as long as their content impacts the advancement of computer science. In particular, articles that review the application of well-known Computer Science methods to other areas are in scope only if these articles advance the fundamental understanding of those methods.