{"title":"TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features","authors":"","doi":"10.1016/j.comnet.2024.110707","DOIUrl":null,"url":null,"abstract":"<div><p>While machine learning progress is advancing the detection of malicious URLs, advanced Transformers applied to URLs face difficulties in extracting local information, character-level information, and structural relationships. To address these challenges, we propose a novel approach for malicious URL detection, named TransURL, that is implemented by co-training the character-aware Transformer with three feature modules—Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. This special Transformer allows TransURL to extract embeddings that contain character-level information from URL token sequences, with three feature modules contributing to the fusion of multi-layer Transformer encodings and the capture of multi-scale local details and structural relationships. The proposed method is evaluated across several challenging scenarios, including class imbalance learning, multi-classification, cross-dataset testing, and adversarial sample attacks. The experimental results demonstrate a significant improvement compared to the best previous methods. For instance, it achieved a peak F1-score improvement of 40% in class-imbalanced scenarios, and exceeded the best baseline result by 14.13% in accuracy in adversarial attack scenarios. Additionally, we conduct a case study where our method accurately identifies all 30 active malicious web pages, whereas two pior SOTA methods miss 4 and 7 malicious web pages respectively. The codes and data are available at: <span><span>https://github.com/Vul-det/TransURL/</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128624005395","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
While machine learning progress is advancing the detection of malicious URLs, advanced Transformers applied to URLs face difficulties in extracting local information, character-level information, and structural relationships. To address these challenges, we propose a novel approach for malicious URL detection, named TransURL, that is implemented by co-training the character-aware Transformer with three feature modules—Multi-Layer Encoding, Multi-Scale Feature Learning, and Spatial Pyramid Attention. This special Transformer allows TransURL to extract embeddings that contain character-level information from URL token sequences, with three feature modules contributing to the fusion of multi-layer Transformer encodings and the capture of multi-scale local details and structural relationships. The proposed method is evaluated across several challenging scenarios, including class imbalance learning, multi-classification, cross-dataset testing, and adversarial sample attacks. The experimental results demonstrate a significant improvement compared to the best previous methods. For instance, it achieved a peak F1-score improvement of 40% in class-imbalanced scenarios, and exceeded the best baseline result by 14.13% in accuracy in adversarial attack scenarios. Additionally, we conduct a case study where our method accurately identifies all 30 active malicious web pages, whereas two pior SOTA methods miss 4 and 7 malicious web pages respectively. The codes and data are available at: https://github.com/Vul-det/TransURL/.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.