TransURL：利用多层变换器编码和多尺度金字塔特征改进恶意 URL 检测

IF 4.4 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Computer Networks Pub Date : 2024-08-13 DOI:10.1016/j.comnet.2024.110707

{"title":"TransURL：利用多层变换器编码和多尺度金字塔特征改进恶意 URL 检测","authors":"","doi":"10.1016/j.comnet.2024.110707","DOIUrl":null,"url":null,"abstract":"<div><p>While machine learning progress is advancing the detection of malicious URLs, advanced Transformers applied to URLs face difficulties in extracting local information, character-level information, and structural relationships. To address these challenges, we propose a novel approach for malicious URL detection, named TransURL, that is implemented by co-training the character-aware Transformer with three feature modules—Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. This special Transformer allows TransURL to extract embeddings that contain character-level information from URL token sequences, with three feature modules contributing to the fusion of multi-layer Transformer encodings and the capture of multi-scale local details and structural relationships. The proposed method is evaluated across several challenging scenarios, including class imbalance learning, multi-classification, cross-dataset testing, and adversarial sample attacks. The experimental results demonstrate a significant improvement compared to the best previous methods. For instance, it achieved a peak F1-score improvement of 40% in class-imbalanced scenarios, and exceeded the best baseline result by 14.13% in accuracy in adversarial attack scenarios. Additionally, we conduct a case study where our method accurately identifies all 30 active malicious web pages, whereas two pior SOTA methods miss 4 and 7 malicious web pages respectively. The codes and data are available at: <span><span>https://github.com/Vul-det/TransURL/</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features\",\"authors\":\"\",\"doi\":\"10.1016/j.comnet.2024.110707\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>While machine learning progress is advancing the detection of malicious URLs, advanced Transformers applied to URLs face difficulties in extracting local information, character-level information, and structural relationships. To address these challenges, we propose a novel approach for malicious URL detection, named TransURL, that is implemented by co-training the character-aware Transformer with three feature modules—Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. This special Transformer allows TransURL to extract embeddings that contain character-level information from URL token sequences, with three feature modules contributing to the fusion of multi-layer Transformer encodings and the capture of multi-scale local details and structural relationships. The proposed method is evaluated across several challenging scenarios, including class imbalance learning, multi-classification, cross-dataset testing, and adversarial sample attacks. The experimental results demonstrate a significant improvement compared to the best previous methods. For instance, it achieved a peak F1-score improvement of 40% in class-imbalanced scenarios, and exceeded the best baseline result by 14.13% in accuracy in adversarial attack scenarios. Additionally, we conduct a case study where our method accurately identifies all 30 active malicious web pages, whereas two pior SOTA methods miss 4 and 7 malicious web pages respectively. The codes and data are available at: <span><span>https://github.com/Vul-det/TransURL/</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":50637,\"journal\":{\"name\":\"Computer Networks\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-08-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Networks\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1389128624005395\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128624005395","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

虽然机器学习的进步推动了恶意 URL 的检测，但应用于 URL 的高级变换器在提取本地信息、字符级信息和结构关系方面面临困难。为了应对这些挑战，我们提出了一种新颖的恶意 URL 检测方法，命名为 TransURL，它是通过将字符感知变换器与三个特征模块（多层编码、多尺度特征学习和空间金字塔注意）共同训练而实现的。这种特殊的变换器允许 TransURL 从 URL 标记序列中提取包含字符级信息的嵌入，三个特征模块有助于融合多层变换器编码和捕捉多尺度局部细节和结构关系。我们在多个具有挑战性的场景中对所提出的方法进行了评估，包括类不平衡学习、多分类、跨数据集测试和对抗性样本攻击。实验结果表明，与之前的最佳方法相比，该方法有了显著改进。例如，在类不平衡场景中，它的 F1 分数峰值提高了 40%，在对抗性攻击场景中，它的准确率比最佳基线结果高出 14.13%。此外，我们还进行了一项案例研究，在这项研究中，我们的方法准确识别了所有 30 个活跃的恶意网页，而之前的两种 SOTA 方法分别漏掉了 4 个和 7 个恶意网页。代码和数据可在以下网址获取：https://github.com/Vul-det/TransURL/。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features

While machine learning progress is advancing the detection of malicious URLs, advanced Transformers applied to URLs face difficulties in extracting local information, character-level information, and structural relationships. To address these challenges, we propose a novel approach for malicious URL detection, named TransURL, that is implemented by co-training the character-aware Transformer with three feature modules—Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. This special Transformer allows TransURL to extract embeddings that contain character-level information from URL token sequences, with three feature modules contributing to the fusion of multi-layer Transformer encodings and the capture of multi-scale local details and structural relationships. The proposed method is evaluated across several challenging scenarios, including class imbalance learning, multi-classification, cross-dataset testing, and adversarial sample attacks. The experimental results demonstrate a significant improvement compared to the best previous methods. For instance, it achieved a peak F1-score improvement of 40% in class-imbalanced scenarios, and exceeded the best baseline result by 14.13% in accuracy in adversarial attack scenarios. Additionally, we conduct a case study where our method accurately identifies all 30 active malicious web pages, whereas two pior SOTA methods miss 4 and 7 malicious web pages respectively. The codes and data are available at: https://github.com/Vul-det/TransURL/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Networks 工程技术-电信学

CiteScore

10.80

自引率

3.60%

发文量

434

审稿时长

8.6 months

期刊介绍： Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.