{"title":"从过去到现在:恶意URL检测技术、数据集和代码库的调查","authors":"Ye Tian , Yanqiu Yu , Jianguo Sun , Yanbin Wang","doi":"10.1016/j.cosrev.2025.100810","DOIUrl":null,"url":null,"abstract":"<div><div>Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. The detection of malicious URLs is a protracted arms race between defenders and attackers. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews suffer from four critical limitations: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.</div><div>This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g., Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (e.g., lexical URL features, HTML structure, JavaScript behavior, visual layout).For instance, we group models that parse DOM trees and extract HTML tag paths into the HTML modality, while those using rendered webpage screenshots are classified under the visual modality. This taxonomy reveals how distinct input channels inform model design, offering new perspectives that are obscured in algorithm-based classifications. This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze two key resources: 1) publicly available datasets (2016–2024), and 2) open-source implementations from published works (2013–2025). To facilitate cross-method comparison and support future benchmarking efforts, we compile a comparative table summarizing key performance metrics (e.g., Accuracy, F1 Score, AUC) as reported in the original works or open-source repositories. While no formal evaluation protocol is proposed, this effort provides a practical reference point and highlights the need for more standardized benchmarking practices in malicious URL detection research. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curation of datasets and open-source implementations: <span><span>https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":48633,"journal":{"name":"Computer Science Review","volume":"58 ","pages":"Article 100810"},"PeriodicalIF":12.7000,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"From past to present: A survey of malicious URL detection techniques, datasets and code repositories\",\"authors\":\"Ye Tian , Yanqiu Yu , Jianguo Sun , Yanbin Wang\",\"doi\":\"10.1016/j.cosrev.2025.100810\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. The detection of malicious URLs is a protracted arms race between defenders and attackers. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews suffer from four critical limitations: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.</div><div>This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g., Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (e.g., lexical URL features, HTML structure, JavaScript behavior, visual layout).For instance, we group models that parse DOM trees and extract HTML tag paths into the HTML modality, while those using rendered webpage screenshots are classified under the visual modality. This taxonomy reveals how distinct input channels inform model design, offering new perspectives that are obscured in algorithm-based classifications. This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze two key resources: 1) publicly available datasets (2016–2024), and 2) open-source implementations from published works (2013–2025). To facilitate cross-method comparison and support future benchmarking efforts, we compile a comparative table summarizing key performance metrics (e.g., Accuracy, F1 Score, AUC) as reported in the original works or open-source repositories. While no formal evaluation protocol is proposed, this effort provides a practical reference point and highlights the need for more standardized benchmarking practices in malicious URL detection research. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curation of datasets and open-source implementations: <span><span>https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":48633,\"journal\":{\"name\":\"Computer Science Review\",\"volume\":\"58 \",\"pages\":\"Article 100810\"},\"PeriodicalIF\":12.7000,\"publicationDate\":\"2025-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Science Review\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1574013725000863\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Science Review","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1574013725000863","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
From past to present: A survey of malicious URL detection techniques, datasets and code repositories
Malicious URLs persistently threaten the cybersecurity ecosystem, by either deceiving users into divulging private data or distributing harmful payloads to infiltrate host systems. The detection of malicious URLs is a protracted arms race between defenders and attackers. Gaining timely insights into the current state of this ongoing battle holds significant importance. However, existing reviews suffer from four critical limitations: 1) Their reliance on algorithm-centric taxonomies obscures understanding of how detection approaches exploit specific modal information channels; 2) They fail to incorporate pivotal LLM/Transformer-based defenses; 3) No open-source implementations are collected to facilitate benchmarking; 4) Insufficient dataset coverage.
This paper presents a comprehensive review of malicious URL detection technologies, systematically analyzing methods from traditional blacklisting to advanced deep learning approaches (e.g., Transformer, GNNs, and LLMs). Unlike prior surveys, we propose a novel modality-based taxonomy that categorizes existing works according to their primary data modalities (e.g., lexical URL features, HTML structure, JavaScript behavior, visual layout).For instance, we group models that parse DOM trees and extract HTML tag paths into the HTML modality, while those using rendered webpage screenshots are classified under the visual modality. This taxonomy reveals how distinct input channels inform model design, offering new perspectives that are obscured in algorithm-based classifications. This hierarchical classification enables both rigorous technical analysis and clear understanding of multimodal information utilization. Furthermore, to establish a profile of accessible datasets and address the lack of standardized benchmarking (where current studies often lack proper baseline comparisons), we curate and analyze two key resources: 1) publicly available datasets (2016–2024), and 2) open-source implementations from published works (2013–2025). To facilitate cross-method comparison and support future benchmarking efforts, we compile a comparative table summarizing key performance metrics (e.g., Accuracy, F1 Score, AUC) as reported in the original works or open-source repositories. While no formal evaluation protocol is proposed, this effort provides a practical reference point and highlights the need for more standardized benchmarking practices in malicious URL detection research. The review concludes by examining emerging challenges and proposing actionable directions for future research. We maintain a GitHub repository for ongoing curation of datasets and open-source implementations: https://github.com/sevenolu7/Malicious-URL-Detection-Open-Source/tree/master.
期刊介绍:
Computer Science Review, a publication dedicated to research surveys and expository overviews of open problems in computer science, targets a broad audience within the field seeking comprehensive insights into the latest developments. The journal welcomes articles from various fields as long as their content impacts the advancement of computer science. In particular, articles that review the application of well-known Computer Science methods to other areas are in scope only if these articles advance the fundamental understanding of those methods.