{"title":"基于RoBERTa和稀疏自编码器的恶意URL检测新方法","authors":"Zhiqing Huang , Tian Ban , Yanxin Zhang","doi":"10.1016/j.jisa.2025.104214","DOIUrl":null,"url":null,"abstract":"<div><div>Detecting malicious URLs within requests is an effective method for blocking Web threats. Current methods for detecting malicious URLs mainly rely on supervised machine learning algorithms to construct classification models, which consequently demand high-quality training data. And these methods also have limitations in detecting malicious samples, resulting in a high false negative rate when encountering unknown anomalies. This paper proposes an anomaly detection method based on RoBERTa and sparse autoencoder for detecting malicious URLs. This method initially involves preprocessing the URL samples. Subsequently, RoBERTa is used to extract features from URLs and converts them into feature vectors. Sparse autoencoder is utilized to detect malicious samples ultimately. During the model training process, only benign samples are used as input. It enables sparse autoencoder to effectively reconstruct the characteristics of benign samples to identify malicious ones. This method was tested on the dataset composed of CSIC2010 and PRDREQ. The experimental results show that the detection model achieves an accuracy of 0.9921, a recall of 0.9863, and an F1 score of 0.9887, outperforming all baseline methods.</div></div>","PeriodicalId":48638,"journal":{"name":"Journal of Information Security and Applications","volume":"94 ","pages":"Article 104214"},"PeriodicalIF":3.7000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel approach for malicious URL detection using RoBERTa and sparse autoencoder\",\"authors\":\"Zhiqing Huang , Tian Ban , Yanxin Zhang\",\"doi\":\"10.1016/j.jisa.2025.104214\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Detecting malicious URLs within requests is an effective method for blocking Web threats. Current methods for detecting malicious URLs mainly rely on supervised machine learning algorithms to construct classification models, which consequently demand high-quality training data. And these methods also have limitations in detecting malicious samples, resulting in a high false negative rate when encountering unknown anomalies. This paper proposes an anomaly detection method based on RoBERTa and sparse autoencoder for detecting malicious URLs. This method initially involves preprocessing the URL samples. Subsequently, RoBERTa is used to extract features from URLs and converts them into feature vectors. Sparse autoencoder is utilized to detect malicious samples ultimately. During the model training process, only benign samples are used as input. It enables sparse autoencoder to effectively reconstruct the characteristics of benign samples to identify malicious ones. This method was tested on the dataset composed of CSIC2010 and PRDREQ. The experimental results show that the detection model achieves an accuracy of 0.9921, a recall of 0.9863, and an F1 score of 0.9887, outperforming all baseline methods.</div></div>\",\"PeriodicalId\":48638,\"journal\":{\"name\":\"Journal of Information Security and Applications\",\"volume\":\"94 \",\"pages\":\"Article 104214\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information Security and Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2214212625002510\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Security and Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2214212625002510","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
A novel approach for malicious URL detection using RoBERTa and sparse autoencoder
Detecting malicious URLs within requests is an effective method for blocking Web threats. Current methods for detecting malicious URLs mainly rely on supervised machine learning algorithms to construct classification models, which consequently demand high-quality training data. And these methods also have limitations in detecting malicious samples, resulting in a high false negative rate when encountering unknown anomalies. This paper proposes an anomaly detection method based on RoBERTa and sparse autoencoder for detecting malicious URLs. This method initially involves preprocessing the URL samples. Subsequently, RoBERTa is used to extract features from URLs and converts them into feature vectors. Sparse autoencoder is utilized to detect malicious samples ultimately. During the model training process, only benign samples are used as input. It enables sparse autoencoder to effectively reconstruct the characteristics of benign samples to identify malicious ones. This method was tested on the dataset composed of CSIC2010 and PRDREQ. The experimental results show that the detection model achieves an accuracy of 0.9921, a recall of 0.9863, and an F1 score of 0.9887, outperforming all baseline methods.
期刊介绍:
Journal of Information Security and Applications (JISA) focuses on the original research and practice-driven applications with relevance to information security and applications. JISA provides a common linkage between a vibrant scientific and research community and industry professionals by offering a clear view on modern problems and challenges in information security, as well as identifying promising scientific and "best-practice" solutions. JISA issues offer a balance between original research work and innovative industrial approaches by internationally renowned information security experts and researchers.