Lixia Xie , Hao Zhang , Hongyu Yang , Ze Hu , Xiang Cheng
{"title":"A scalable phishing website detection model based on dual-branch TCN and mask attention","authors":"Lixia Xie , Hao Zhang , Hongyu Yang , Ze Hu , Xiang Cheng","doi":"10.1016/j.comnet.2025.111230","DOIUrl":null,"url":null,"abstract":"<div><div>Phishing website detection models face challenges such as missing features, limited feature extraction capabilities, and significant computational resource consumption when processing multidimensional features. Additionally, publicly available datasets often lack diversity and scalability, and are vulnerable to disguise attacks, resulting in poor model generalizability.This paper addresses these issues by proposing a multiclass scalable dataset, Crawling2024, collected using a WebDriver-based collector that simulates human operations to avoid attacker disguises. Through data analysis, we identify handcrafted features from access information and URLs. These features help reduce the computational load of deep learning models and expand feature dimensions. Crawling2024 retains data identifiers (IDs), enabling further extension through data scraping.We also introduce a scalable phishing website detection model (SPWDM) that utilizes a dual-branch temporal convolution network (TCN) to extract local correlations and long-term dependencies of domain names. The model incorporates a lightweight spatial-channel (SC) attention mechanism to enhance interactions between channels and space. Additionally, it uses a mask attention mechanism to manage extended features and adjust focus when features are missing. Our feature fusion method combines enhanced features extracted by the dual-branch TCN, with various features processed by the mask attention mechanism.The experimental results demonstrate that our proposed detection method achieves excellent performance, with an accuracy of 97.66% on the Crawling2024 dataset. This is 0.52% to 2.72% higher than other methods, and it maintains a leading position on other public datasets.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"263 ","pages":"Article 111230"},"PeriodicalIF":4.4000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625001987","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Phishing website detection models face challenges such as missing features, limited feature extraction capabilities, and significant computational resource consumption when processing multidimensional features. Additionally, publicly available datasets often lack diversity and scalability, and are vulnerable to disguise attacks, resulting in poor model generalizability.This paper addresses these issues by proposing a multiclass scalable dataset, Crawling2024, collected using a WebDriver-based collector that simulates human operations to avoid attacker disguises. Through data analysis, we identify handcrafted features from access information and URLs. These features help reduce the computational load of deep learning models and expand feature dimensions. Crawling2024 retains data identifiers (IDs), enabling further extension through data scraping.We also introduce a scalable phishing website detection model (SPWDM) that utilizes a dual-branch temporal convolution network (TCN) to extract local correlations and long-term dependencies of domain names. The model incorporates a lightweight spatial-channel (SC) attention mechanism to enhance interactions between channels and space. Additionally, it uses a mask attention mechanism to manage extended features and adjust focus when features are missing. Our feature fusion method combines enhanced features extracted by the dual-branch TCN, with various features processed by the mask attention mechanism.The experimental results demonstrate that our proposed detection method achieves excellent performance, with an accuracy of 97.66% on the Crawling2024 dataset. This is 0.52% to 2.72% higher than other methods, and it maintains a leading position on other public datasets.
期刊介绍:
Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.