“知道你的域名”:使用基于域名的特征进行无偏的网络钓鱼检测

Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies Pub Date : 2018-06-07 DOI:10.1145/3205977.3205992

H. Shirazi, Bruhadeshwar Bezawada, I. Ray

{"title":"“知道你的域名”:使用基于域名的特征进行无偏的网络钓鱼检测","authors":"H. Shirazi, Bruhadeshwar Bezawada, I. Ray","doi":"10.1145/3205977.3205992","DOIUrl":null,"url":null,"abstract":"Phishing websites remain a persistent security threat. Thus far, machine learning approaches appear to have the best potential as defenses. But, there are two main concerns with existing machine learning approaches for phishing detection. The first is the large number of training features used and the lack of validating arguments for these feature choices. The second concern is the type of datasets used in the literature that are inadvertently biased with respect to the features based on the website URL or content. To address these concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. Accordingly, we design features that model the relationships, visual as well as statistical, of the domain name to the key elements of a phishing website, which are used to snare the end-users. The main value of our feature design is that, to bypass detection, an attacker will find it very difficult to tamper with the visual content of the phishing website without arousing the suspicion of the end user. Our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards specific datasets. We show the robustness of our learning algorithm by testing on unknown live phishing URLs and achieve a high detection accuracy of $99.7%$.","PeriodicalId":423087,"journal":{"name":"Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"73","resultStr":"{\"title\":\"\\\"Kn0w Thy Doma1n Name\\\": Unbiased Phishing Detection Using Domain Name Based Features\",\"authors\":\"H. Shirazi, Bruhadeshwar Bezawada, I. Ray\",\"doi\":\"10.1145/3205977.3205992\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Phishing websites remain a persistent security threat. Thus far, machine learning approaches appear to have the best potential as defenses. But, there are two main concerns with existing machine learning approaches for phishing detection. The first is the large number of training features used and the lack of validating arguments for these feature choices. The second concern is the type of datasets used in the literature that are inadvertently biased with respect to the features based on the website URL or content. To address these concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. Accordingly, we design features that model the relationships, visual as well as statistical, of the domain name to the key elements of a phishing website, which are used to snare the end-users. The main value of our feature design is that, to bypass detection, an attacker will find it very difficult to tamper with the visual content of the phishing website without arousing the suspicion of the end user. Our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards specific datasets. We show the robustness of our learning algorithm by testing on unknown live phishing URLs and achieve a high detection accuracy of $99.7%$.\",\"PeriodicalId\":423087,\"journal\":{\"name\":\"Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"73\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3205977.3205992\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3205977.3205992","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 73

摘要

网络钓鱼网站仍然是一个持久的安全威胁。到目前为止，机器学习方法似乎是最有潜力的防御手段。但是，现有的用于网络钓鱼检测的机器学习方法有两个主要问题。首先是使用了大量的训练特征，并且缺乏对这些特征选择的验证参数。第二个问题是文献中使用的数据集的类型，这些数据集无意中偏向于基于网站URL或内容的特征。为了解决这些问题，我们提出了网络钓鱼网站的域名是网络钓鱼的标志，是成功检测网络钓鱼的关键的直觉。因此，我们设计了一些特征，对域名与网络钓鱼网站的关键元素之间的关系进行建模，包括视觉上的和统计上的，这些元素被用来诱骗最终用户。我们的特征设计的主要价值在于，为了绕过检测，攻击者很难在不引起最终用户怀疑的情况下篡改钓鱼网站的视觉内容。我们的特征集确保对数据集的偏差最小或没有偏差。我们的学习模型只训练了7个特征，在样本数据集上实现了98%的真阳性率和97%的分类准确率。与最先进的工作相比，我们对合法网站的每个数据实例分类速度快4倍，对钓鱼网站的分类速度快10倍。重要的是，我们展示了使用基于url的功能的缺点，因为它们可能偏向于特定的数据集。我们通过对未知的实时网络钓鱼url进行测试，证明了我们的学习算法的鲁棒性，并实现了99.7%的高检测准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

"Kn0w Thy Doma1n Name": Unbiased Phishing Detection Using Domain Name Based Features

Phishing websites remain a persistent security threat. Thus far, machine learning approaches appear to have the best potential as defenses. But, there are two main concerns with existing machine learning approaches for phishing detection. The first is the large number of training features used and the lack of validating arguments for these feature choices. The second concern is the type of datasets used in the literature that are inadvertently biased with respect to the features based on the website URL or content. To address these concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. Accordingly, we design features that model the relationships, visual as well as statistical, of the domain name to the key elements of a phishing website, which are used to snare the end-users. The main value of our feature design is that, to bypass detection, an attacker will find it very difficult to tamper with the visual content of the phishing website without arousing the suspicion of the end user. Our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards specific datasets. We show the robustness of our learning algorithm by testing on unknown live phishing URLs and achieve a high detection accuracy of $99.7%$.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies

自引率

0.00%

发文量