URL-based Phishing Detection using the Entropy of Non-Alphanumeric Characters

Eint Sandi Aung, H. Yamana
{"title":"URL-based Phishing Detection using the Entropy of Non-Alphanumeric Characters","authors":"Eint Sandi Aung, H. Yamana","doi":"10.1145/3366030.3366064","DOIUrl":null,"url":null,"abstract":"Phishing is a type of personal information theft in which phishers lure users to steal sensitive information. Phishing detection mechanisms using various techniques have been developed. Our hypothesis is that phishers create fake websites with as little information as possible in a webpage, which makes it difficult for content- and visual similarity-based detections by analyzing the webpage content. To overcome this, we focus on the use of Uniform Resource Locators (URLs) to detect phishing. Since previous work extracts specific special-character features, we assume that non-alphanumeric (NAN) character distributions highly impact the performance of URL-based detection. We hence propose a new feature called the entropy of NAN characters for URL-based phishing detection. Experimental evaluation with balanced and imbalanced datasets shows 96% ROC AUC on the balanced dataset and 89% ROC AUC on the imbalanced dataset, which increases the ROC AUC as 5 to 6% from without adopting our proposed feature.","PeriodicalId":446280,"journal":{"name":"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services","volume":"150 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366030.3366064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Phishing is a type of personal information theft in which phishers lure users to steal sensitive information. Phishing detection mechanisms using various techniques have been developed. Our hypothesis is that phishers create fake websites with as little information as possible in a webpage, which makes it difficult for content- and visual similarity-based detections by analyzing the webpage content. To overcome this, we focus on the use of Uniform Resource Locators (URLs) to detect phishing. Since previous work extracts specific special-character features, we assume that non-alphanumeric (NAN) character distributions highly impact the performance of URL-based detection. We hence propose a new feature called the entropy of NAN characters for URL-based phishing detection. Experimental evaluation with balanced and imbalanced datasets shows 96% ROC AUC on the balanced dataset and 89% ROC AUC on the imbalanced dataset, which increases the ROC AUC as 5 to 6% from without adopting our proposed feature.
基于非字母数字字符熵的url网络钓鱼检测
网络钓鱼是一种个人信息盗窃,钓鱼者引诱用户窃取敏感信息。已经开发了使用各种技术的网络钓鱼检测机制。我们的假设是,钓鱼者在网页上尽可能少地创建虚假网站,这使得通过分析网页内容来进行基于内容和视觉相似性的检测变得困难。为了克服这个问题,我们将重点放在使用统一资源定位器(url)来检测网络钓鱼。由于以前的工作提取了特定的特殊字符特征,我们假设非字母数字(NAN)字符分布严重影响基于url的检测性能。因此,我们提出了一种新的特征,称为NAN字符熵,用于基于url的网络钓鱼检测。平衡和不平衡数据集的实验评估表明,平衡数据集的ROC AUC为96%,不平衡数据集的ROC AUC为89%,与不采用我们提出的特征相比,ROC AUC提高了5 - 6%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信