Mehmet Korkmaz, Emre Kocyigit, O. K. Sahingoz, B. Diri
{"title":"Phishing Web Page Detection Using N-gram Features Extracted From URLs","authors":"Mehmet Korkmaz, Emre Kocyigit, O. K. Sahingoz, B. Diri","doi":"10.1109/HORA52670.2021.9461378","DOIUrl":null,"url":null,"abstract":"Recently, cyber-attacks have increased worldwide, especially during the pandemic period. The number of connected devices in the world and the anonymous structure of the internet enable this security deficit for not only computer networks but also single computing devices. With the connected use of computing device in anytime and anywhere conditions, lots of real-world activities are transferred to the digital world by adapting them to new lifestyles. Thus, the concept of cybersecurity has become more focused not only for security admins but also for academicians/researchers. Phishing attacks, which hackers mostly prefer to use in the last decade, have become even more harmful because its focuses on the weakest part of the security chain: computer user. Therefore, it is extremely important to prevent these cyber-attacks before they reach users. Based on this idea, we aimed to implement a phishing detection system by using a Convolutional Neural Network with n-gram features that are extracted from URLs. There are different n-gram feature extraction techniques, and in this work, it is aimed to determine which of them is more effective for our proposals. As a second goal, it is aimed to discover what parameters of the n-gram work best. In experiments, it is discovered that unigram has the highest accuracy rate. It was observed that, instead of all the characters that are obtained in unigram, the specified 70 characters (regardless of case sensitivity) give the highest accuracy rate of 88.90% with a High-Risk URL dataset. Experimental results also showed that a URL can be classified (either as legitimate or phishing) in about 0.008 seconds. These metrics can be accepted at a very good rate both in accuracy and run-time efficiency.","PeriodicalId":270469,"journal":{"name":"2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HORA52670.2021.9461378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
Recently, cyber-attacks have increased worldwide, especially during the pandemic period. The number of connected devices in the world and the anonymous structure of the internet enable this security deficit for not only computer networks but also single computing devices. With the connected use of computing device in anytime and anywhere conditions, lots of real-world activities are transferred to the digital world by adapting them to new lifestyles. Thus, the concept of cybersecurity has become more focused not only for security admins but also for academicians/researchers. Phishing attacks, which hackers mostly prefer to use in the last decade, have become even more harmful because its focuses on the weakest part of the security chain: computer user. Therefore, it is extremely important to prevent these cyber-attacks before they reach users. Based on this idea, we aimed to implement a phishing detection system by using a Convolutional Neural Network with n-gram features that are extracted from URLs. There are different n-gram feature extraction techniques, and in this work, it is aimed to determine which of them is more effective for our proposals. As a second goal, it is aimed to discover what parameters of the n-gram work best. In experiments, it is discovered that unigram has the highest accuracy rate. It was observed that, instead of all the characters that are obtained in unigram, the specified 70 characters (regardless of case sensitivity) give the highest accuracy rate of 88.90% with a High-Risk URL dataset. Experimental results also showed that a URL can be classified (either as legitimate or phishing) in about 0.008 seconds. These metrics can be accepted at a very good rate both in accuracy and run-time efficiency.