用于恶意URL检测和网页分类的连续多任务预训练

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computer Networks Pub Date : 2025-07-12 DOI:10.1016/j.comnet.2025.111513

Yujie Li , Yiwei Liu , Peiyue Li , Yifan Jia , Yanbin Wang

{"title":"用于恶意URL检测和网页分类的连续多任务预训练","authors":"Yujie Li , Yiwei Liu , Peiyue Li , Yifan Jia , Yanbin Wang","doi":"10.1016/j.comnet.2025.111513","DOIUrl":null,"url":null,"abstract":"<div><div>Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management. In recent years, extensive research has explored using BERT or similar language models to replace traditional machine learning methods for detecting malicious URLs and classifying webpages. While previous studies show promising results, they often apply existing language models to these tasks without accounting for the inherent differences in domain data (e.g., URLs being loosely structured and semantically sparse compared to text), leaving room for performance improvement. Furthermore, current approaches focus on single tasks and have not been tested in multi-task scenarios.</div><div>To address these challenges, we propose <span>urlBERT</span>, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs. To achieve it, we propose to use 5 unsupervised pretraining tasks to capture multi-level information of URL lexical, syntax, and semantics, and generate contrastive and adversarial representations. Furthermore, to avoid inter-pre-training competition and interference, we proposed a grouped sequential learning method to ensure effective training across multi-tasks. Finally, we leverage a two-stage fine-tuning approach to improve the training stability and efficiency of the task model. To assess the multitasking potential of <span>urlBERT</span>, we fine-tune the task model in both single-task and multi-task modes. The former creates a classification model for a single task, while the latter builds a classification model capable of handling multiple tasks. We evaluate URLBERT on three downstream tasks: phishing URL detection, advertising URL detection, and webpage classification. The results demonstrate that <span>urlBERT</span> outperforms standard pre-trained models, and its multi-task mode is capable of addressing the real-world demands of multitasking. The code is available at <span><span>https://github.com/Davidup1/URLBERT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"270 ","pages":"Article 111513"},"PeriodicalIF":4.4000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Continuous multi-task pre-training for malicious URL detection and webpage classification\",\"authors\":\"Yujie Li , Yiwei Liu , Peiyue Li , Yifan Jia , Yanbin Wang\",\"doi\":\"10.1016/j.comnet.2025.111513\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management. In recent years, extensive research has explored using BERT or similar language models to replace traditional machine learning methods for detecting malicious URLs and classifying webpages. While previous studies show promising results, they often apply existing language models to these tasks without accounting for the inherent differences in domain data (e.g., URLs being loosely structured and semantically sparse compared to text), leaving room for performance improvement. Furthermore, current approaches focus on single tasks and have not been tested in multi-task scenarios.</div><div>To address these challenges, we propose <span>urlBERT</span>, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs. To achieve it, we propose to use 5 unsupervised pretraining tasks to capture multi-level information of URL lexical, syntax, and semantics, and generate contrastive and adversarial representations. Furthermore, to avoid inter-pre-training competition and interference, we proposed a grouped sequential learning method to ensure effective training across multi-tasks. Finally, we leverage a two-stage fine-tuning approach to improve the training stability and efficiency of the task model. To assess the multitasking potential of <span>urlBERT</span>, we fine-tune the task model in both single-task and multi-task modes. The former creates a classification model for a single task, while the latter builds a classification model capable of handling multiple tasks. We evaluate URLBERT on three downstream tasks: phishing URL detection, advertising URL detection, and webpage classification. The results demonstrate that <span>urlBERT</span> outperforms standard pre-trained models, and its multi-task mode is capable of addressing the real-world demands of multitasking. The code is available at <span><span>https://github.com/Davidup1/URLBERT</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50637,\"journal\":{\"name\":\"Computer Networks\",\"volume\":\"270 \",\"pages\":\"Article 111513\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1389128625004803\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625004803","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

恶意URL检测和网页分类是网络安全和信息管理中的重要任务。近年来，广泛的研究探索了使用BERT或类似的语言模型来取代传统的机器学习方法来检测恶意url和对网页进行分类。虽然以前的研究显示了有希望的结果，但他们经常将现有的语言模型应用于这些任务，而没有考虑到领域数据的固有差异（例如，与文本相比，url结构松散，语义稀疏），从而留下了性能改进的空间。此外，目前的方法侧重于单一任务，尚未在多任务场景中进行测试。为了应对这些挑战，我们提出了urlBERT，这是一个利用Transformer对数十亿未标记URL的基础知识进行编码的预训练URL编码器。为了实现这一目标，我们建议使用5个无监督预训练任务来捕获URL的词法、语法和语义的多层次信息，并生成对比和对抗表示。此外，为了避免训练前的竞争和干扰，我们提出了一种分组顺序学习方法，以确保跨多任务的有效训练。最后，我们利用两阶段微调方法来提高任务模型的训练稳定性和效率。为了评估urlBERT的多任务潜力，我们在单任务和多任务模式下对任务模型进行了微调。前者为单个任务创建分类模型，而后者构建能够处理多个任务的分类模型。我们在三个下游任务上评估URLBERT：网络钓鱼URL检测、广告URL检测和网页分类。结果表明，urlBERT优于标准的预训练模型，其多任务模式能够解决多任务处理的现实需求。代码可在https://github.com/Davidup1/URLBERT上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Continuous multi-task pre-training for malicious URL detection and webpage classification

Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management. In recent years, extensive research has explored using BERT or similar language models to replace traditional machine learning methods for detecting malicious URLs and classifying webpages. While previous studies show promising results, they often apply existing language models to these tasks without accounting for the inherent differences in domain data (e.g., URLs being loosely structured and semantically sparse compared to text), leaving room for performance improvement. Furthermore, current approaches focus on single tasks and have not been tested in multi-task scenarios.

To address these challenges, we propose urlBERT, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs. To achieve it, we propose to use 5 unsupervised pretraining tasks to capture multi-level information of URL lexical, syntax, and semantics, and generate contrastive and adversarial representations. Furthermore, to avoid inter-pre-training competition and interference, we proposed a grouped sequential learning method to ensure effective training across multi-tasks. Finally, we leverage a two-stage fine-tuning approach to improve the training stability and efficiency of the task model. To assess the multitasking potential of urlBERT, we fine-tune the task model in both single-task and multi-task modes. The former creates a classification model for a single task, while the latter builds a classification model capable of handling multiple tasks. We evaluate URLBERT on three downstream tasks: phishing URL detection, advertising URL detection, and webpage classification. The results demonstrate that urlBERT outperforms standard pre-trained models, and its multi-task mode is capable of addressing the real-world demands of multitasking. The code is available at https://github.com/Davidup1/URLBERT.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Networks 工程技术-电信学

CiteScore

10.80

自引率

3.60%

发文量

434

审稿时长

8.6 months

期刊介绍： Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.