A scalable phishing website detection model based on dual-branch TCN and mask attention

IF 4.4 2区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Lixia Xie , Hao Zhang , Hongyu Yang , Ze Hu , Xiang Cheng
{"title":"A scalable phishing website detection model based on dual-branch TCN and mask attention","authors":"Lixia Xie ,&nbsp;Hao Zhang ,&nbsp;Hongyu Yang ,&nbsp;Ze Hu ,&nbsp;Xiang Cheng","doi":"10.1016/j.comnet.2025.111230","DOIUrl":null,"url":null,"abstract":"<div><div>Phishing website detection models face challenges such as missing features, limited feature extraction capabilities, and significant computational resource consumption when processing multidimensional features. Additionally, publicly available datasets often lack diversity and scalability, and are vulnerable to disguise attacks, resulting in poor model generalizability.This paper addresses these issues by proposing a multiclass scalable dataset, Crawling2024, collected using a WebDriver-based collector that simulates human operations to avoid attacker disguises. Through data analysis, we identify handcrafted features from access information and URLs. These features help reduce the computational load of deep learning models and expand feature dimensions. Crawling2024 retains data identifiers (IDs), enabling further extension through data scraping.We also introduce a scalable phishing website detection model (SPWDM) that utilizes a dual-branch temporal convolution network (TCN) to extract local correlations and long-term dependencies of domain names. The model incorporates a lightweight spatial-channel (SC) attention mechanism to enhance interactions between channels and space. Additionally, it uses a mask attention mechanism to manage extended features and adjust focus when features are missing. Our feature fusion method combines enhanced features extracted by the dual-branch TCN, with various features processed by the mask attention mechanism.The experimental results demonstrate that our proposed detection method achieves excellent performance, with an accuracy of 97.66% on the Crawling2024 dataset. This is 0.52% to 2.72% higher than other methods, and it maintains a leading position on other public datasets.</div></div>","PeriodicalId":50637,"journal":{"name":"Computer Networks","volume":"263 ","pages":"Article 111230"},"PeriodicalIF":4.4000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1389128625001987","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Phishing website detection models face challenges such as missing features, limited feature extraction capabilities, and significant computational resource consumption when processing multidimensional features. Additionally, publicly available datasets often lack diversity and scalability, and are vulnerable to disguise attacks, resulting in poor model generalizability.This paper addresses these issues by proposing a multiclass scalable dataset, Crawling2024, collected using a WebDriver-based collector that simulates human operations to avoid attacker disguises. Through data analysis, we identify handcrafted features from access information and URLs. These features help reduce the computational load of deep learning models and expand feature dimensions. Crawling2024 retains data identifiers (IDs), enabling further extension through data scraping.We also introduce a scalable phishing website detection model (SPWDM) that utilizes a dual-branch temporal convolution network (TCN) to extract local correlations and long-term dependencies of domain names. The model incorporates a lightweight spatial-channel (SC) attention mechanism to enhance interactions between channels and space. Additionally, it uses a mask attention mechanism to manage extended features and adjust focus when features are missing. Our feature fusion method combines enhanced features extracted by the dual-branch TCN, with various features processed by the mask attention mechanism.The experimental results demonstrate that our proposed detection method achieves excellent performance, with an accuracy of 97.66% on the Crawling2024 dataset. This is 0.52% to 2.72% higher than other methods, and it maintains a leading position on other public datasets.
钓鱼网站检测模型面临着特征缺失、特征提取能力有限、处理多维特征时消耗大量计算资源等挑战。此外,公开可用的数据集往往缺乏多样性和可扩展性,而且容易受到伪装攻击,导致模型泛化能力差。本文针对这些问题,提出了一个多类可扩展数据集 Crawling2024,该数据集使用基于 WebDriver 的收集器收集,可模拟人类操作以避免攻击者伪装。通过数据分析,我们从访问信息和 URL 中识别出手工制作的特征。这些特征有助于减少深度学习模型的计算负荷并扩展特征维度。我们还引入了一个可扩展的钓鱼网站检测模型(SPWDM),该模型利用双分支时空卷积网络(TCN)提取域名的局部相关性和长期依赖性。该模型采用轻量级空间信道 (SC) 注意机制,以增强信道和空间之间的交互。此外,它还使用掩码关注机制来管理扩展特征,并在特征缺失时调整关注点。实验结果表明,我们提出的检测方法性能卓越,在 Crawling2024 数据集上的准确率高达 97.66%。实验结果表明,我们提出的检测方法性能优异,在 Crawling2024 数据集上的准确率达到 97.66%,比其他方法高出 0.52% 到 2.72%,在其他公共数据集上也保持领先地位。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Networks
Computer Networks 工程技术-电信学
CiteScore
10.80
自引率
3.60%
发文量
434
审稿时长
8.6 months
期刊介绍: Computer Networks is an international, archival journal providing a publication vehicle for complete coverage of all topics of interest to those involved in the computer communications networking area. The audience includes researchers, managers and operators of networks as well as designers and implementors. The Editorial Board will consider any material for publication that is of interest to those groups.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信