Phishing Web Page Detection Using N-gram Features Extracted From URLs

Mehmet Korkmaz, Emre Kocyigit, O. K. Sahingoz, B. Diri
{"title":"Phishing Web Page Detection Using N-gram Features Extracted From URLs","authors":"Mehmet Korkmaz, Emre Kocyigit, O. K. Sahingoz, B. Diri","doi":"10.1109/HORA52670.2021.9461378","DOIUrl":null,"url":null,"abstract":"Recently, cyber-attacks have increased worldwide, especially during the pandemic period. The number of connected devices in the world and the anonymous structure of the internet enable this security deficit for not only computer networks but also single computing devices. With the connected use of computing device in anytime and anywhere conditions, lots of real-world activities are transferred to the digital world by adapting them to new lifestyles. Thus, the concept of cybersecurity has become more focused not only for security admins but also for academicians/researchers. Phishing attacks, which hackers mostly prefer to use in the last decade, have become even more harmful because its focuses on the weakest part of the security chain: computer user. Therefore, it is extremely important to prevent these cyber-attacks before they reach users. Based on this idea, we aimed to implement a phishing detection system by using a Convolutional Neural Network with n-gram features that are extracted from URLs. There are different n-gram feature extraction techniques, and in this work, it is aimed to determine which of them is more effective for our proposals. As a second goal, it is aimed to discover what parameters of the n-gram work best. In experiments, it is discovered that unigram has the highest accuracy rate. It was observed that, instead of all the characters that are obtained in unigram, the specified 70 characters (regardless of case sensitivity) give the highest accuracy rate of 88.90% with a High-Risk URL dataset. Experimental results also showed that a URL can be classified (either as legitimate or phishing) in about 0.008 seconds. These metrics can be accepted at a very good rate both in accuracy and run-time efficiency.","PeriodicalId":270469,"journal":{"name":"2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HORA52670.2021.9461378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Recently, cyber-attacks have increased worldwide, especially during the pandemic period. The number of connected devices in the world and the anonymous structure of the internet enable this security deficit for not only computer networks but also single computing devices. With the connected use of computing device in anytime and anywhere conditions, lots of real-world activities are transferred to the digital world by adapting them to new lifestyles. Thus, the concept of cybersecurity has become more focused not only for security admins but also for academicians/researchers. Phishing attacks, which hackers mostly prefer to use in the last decade, have become even more harmful because its focuses on the weakest part of the security chain: computer user. Therefore, it is extremely important to prevent these cyber-attacks before they reach users. Based on this idea, we aimed to implement a phishing detection system by using a Convolutional Neural Network with n-gram features that are extracted from URLs. There are different n-gram feature extraction techniques, and in this work, it is aimed to determine which of them is more effective for our proposals. As a second goal, it is aimed to discover what parameters of the n-gram work best. In experiments, it is discovered that unigram has the highest accuracy rate. It was observed that, instead of all the characters that are obtained in unigram, the specified 70 characters (regardless of case sensitivity) give the highest accuracy rate of 88.90% with a High-Risk URL dataset. Experimental results also showed that a URL can be classified (either as legitimate or phishing) in about 0.008 seconds. These metrics can be accepted at a very good rate both in accuracy and run-time efficiency.
基于N-gram特征提取url的网络钓鱼网页检测
最近,全球范围内的网络攻击有所增加,特别是在大流行期间。世界上连接设备的数量和互联网的匿名结构使得这种安全缺陷不仅存在于计算机网络,也存在于单个计算设备。随着计算设备在任何时间和任何地点的连接使用,许多现实世界的活动通过适应新的生活方式被转移到数字世界。因此,网络安全的概念不仅受到安全管理员的关注,也受到学者/研究人员的关注。网络钓鱼攻击是黑客在过去十年中最喜欢使用的一种攻击方式,它的危害甚至更大,因为它针对的是安全链中最薄弱的环节:计算机用户。因此,在这些网络攻击到达用户之前阻止它们是极其重要的。基于这个想法,我们的目标是通过使用从url中提取的n-gram特征的卷积神经网络来实现一个网络钓鱼检测系统。有不同的n-gram特征提取技术,在这项工作中,它的目的是确定哪一种对我们的建议更有效。作为第二个目标,它旨在发现n-gram的哪些参数最有效。在实验中,发现单格图具有最高的准确率。观察到,在高风险URL数据集中,指定的70个字符(不考虑大小写敏感性)的准确率最高,为88.90%,而不是以unigram形式获得的所有字符。实验结果还表明,一个URL可以在大约0.008秒内被分类(无论是合法的还是钓鱼的)。这些指标在准确性和运行时效率方面都能以非常高的速度被接受。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信