Beyond the west: Revealing and bridging the gap between Western and Chinese phishing website detection

IF 4.8 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Computers & Security Pub Date : 2024-09-26 DOI:10.1016/j.cose.2024.104115

Ying Yuan , Giovanni Apruzzese , Mauro Conti

{"title":"Beyond the west: Revealing and bridging the gap between Western and Chinese phishing website detection","authors":"Ying Yuan , Giovanni Apruzzese , Mauro Conti","doi":"10.1016/j.cose.2024.104115","DOIUrl":null,"url":null,"abstract":"<div><div>Phishing attacks are on the rise, and phishing <em>websites</em> are everywhere, denoting the brittleness of security mechanisms reliant on blocklists. To cope with this threat, many works proposed to enhance Phishing Website Detectors (PWD) with data-driven techniques powered by Machine Learning (ML). Despite achieving promising results both in research and practice, existing solutions mostly focus “on the West”, e.g., they consider websites in English, German, or Italian. In contrast, phishing websites targeting “Eastern” countries, such as China, have been mostly neglected—despite phishing being rampant also in this side of the world.</div><div>In this paper, we scrutinize whether current PWD can simultaneously work against Western and Chinese phishing websites. First, after highlighting the difficulties of practically testing PWD on Chinese phishing websites, we create CghPghrg—a dataset which enables assessment of PWD on Chinese websites. Then, we evaluate 72 PWD developed by industry practitioners and 10 ML-based PWD proposed in recent research on Western and Chinese websites: our results highlight that existing solutions, despite achieving low false positive rates, exhibit unacceptably low detection rates (sometimes inferior to 1%) on phishing websites of different <em>regions</em>. Next, to bridge the gap we brought to light, we elucidate the differences between Western and Chinese websites, and devise an enhanced feature set that accounts for the unique characteristics of Chinese websites. We empirically demonstrate the effectiveness of our proposed feature set by replicating (and testing) state-of-the-art ML-PWD: our results show a small but statistically significant improvement over the baselines. Finally, we review all our previous contributions and combine them to develop practical PWD that simultaneously work on Chinese and Western websites, achieving over 0.98 detection rate while maintaining only 0.01 false positive rate in a cross-regional setting. We openly release all our tools, disclose all our benchmark results, and also perform proof-of-concept experiments revealing that the problem tackled by our paper extends to other “Eastern” countries that have been overlooked by prior research on PWD.</div></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"148 ","pages":"Article 104115"},"PeriodicalIF":4.8000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404824004206","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Phishing attacks are on the rise, and phishing websites are everywhere, denoting the brittleness of security mechanisms reliant on blocklists. To cope with this threat, many works proposed to enhance Phishing Website Detectors (PWD) with data-driven techniques powered by Machine Learning (ML). Despite achieving promising results both in research and practice, existing solutions mostly focus “on the West”, e.g., they consider websites in English, German, or Italian. In contrast, phishing websites targeting “Eastern” countries, such as China, have been mostly neglected—despite phishing being rampant also in this side of the world.

In this paper, we scrutinize whether current PWD can simultaneously work against Western and Chinese phishing websites. First, after highlighting the difficulties of practically testing PWD on Chinese phishing websites, we create CghPghrg—a dataset which enables assessment of PWD on Chinese websites. Then, we evaluate 72 PWD developed by industry practitioners and 10 ML-based PWD proposed in recent research on Western and Chinese websites: our results highlight that existing solutions, despite achieving low false positive rates, exhibit unacceptably low detection rates (sometimes inferior to 1%) on phishing websites of different regions. Next, to bridge the gap we brought to light, we elucidate the differences between Western and Chinese websites, and devise an enhanced feature set that accounts for the unique characteristics of Chinese websites. We empirically demonstrate the effectiveness of our proposed feature set by replicating (and testing) state-of-the-art ML-PWD: our results show a small but statistically significant improvement over the baselines. Finally, we review all our previous contributions and combine them to develop practical PWD that simultaneously work on Chinese and Western websites, achieving over 0.98 detection rate while maintaining only 0.01 false positive rate in a cross-regional setting. We openly release all our tools, disclose all our benchmark results, and also perform proof-of-concept experiments revealing that the problem tackled by our paper extends to other “Eastern” countries that have been overlooked by prior research on PWD.

查看原文本刊更多论文

超越西方：揭示并弥合中西方钓鱼网站检测之间的差距

网络钓鱼攻击呈上升趋势，网络钓鱼网站随处可见，这表明依赖于拦截列表的安全机制非常脆弱。为了应对这一威胁，许多研究都提出利用机器学习（ML）驱动的数据驱动技术来增强网络钓鱼网站检测器（PWD）。尽管在研究和实践中都取得了可喜的成果，但现有的解决方案大多侧重于 "西方"，例如，它们考虑的是英语、德语或意大利语网站。相比之下，以中国等 "东方 "国家为目标的钓鱼网站大多被忽视--尽管钓鱼网站在中国也很猖獗。在本文中，我们将仔细研究当前的 PWD 能否同时对付西方和中国的钓鱼网站。首先，我们强调了在中国钓鱼网站上实际测试 PWD 的困难，然后创建了 CghPghrg 数据集，用于评估中国网站上的 PWD。然后，我们对行业从业人员开发的 72 个 PWD 和最近在西方和中国网站上研究提出的 10 个基于 ML 的 PWD 进行了评估：我们的结果表明，现有的解决方案尽管误报率较低，但在不同地区的钓鱼网站上表现出令人无法接受的低检测率（有时低于 1%）。接下来，为了弥补我们发现的差距，我们阐明了中西方网站之间的差异，并根据中国网站的独特性设计了一套增强型特征集。我们通过复制（和测试）最先进的 ML-PWD 验证了我们提出的特征集的有效性：我们的结果表明，与基线相比，我们的特征集有微小但统计上显著的改进。最后，我们回顾了我们之前的所有贡献，并将它们结合起来，开发出同时适用于中国和西方网站的实用 PWD，在跨地区环境中实现了超过 0.98 的检测率，而误报率仅为 0.01。我们公开发布了我们的所有工具，披露了我们的所有基准结果，还进行了概念验证实验，揭示了我们的论文所解决的问题可以扩展到之前的 PWD 研究忽略的其他 "东方 "国家。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.