Classification of Landing and Distribution Domains Using Whois’ Text Mining

2017 IEEE Trustcom/BigDataSE/ICESS Pub Date : 2017-08-01 DOI:10.1109/Trustcom/BigDataSE/ICESS.2017.213

Tran Thao Phuong, A. Yamada, Kosuke Murakami, J. Urakawa, Y. Sawaya, A. Kubota

{"title":"Classification of Landing and Distribution Domains Using Whois’ Text Mining","authors":"Tran Thao Phuong, A. Yamada, Kosuke Murakami, J. Urakawa, Y. Sawaya, A. Kubota","doi":"10.1109/Trustcom/BigDataSE/ICESS.2017.213","DOIUrl":null,"url":null,"abstract":"Detection of drive-by-download attack has gained a focus in security research since the attack has turned into the most popular and serious threat to web infrastructure. The attack exploits vulnerabilities in web browsers and their extensions for unnoticeably downloading malicious software. Often, the victim is sent through a long chain of redirection operations in order to take down the offending pages. Concretely, the attack is triggered when a user visits a benign webpage that is compromised by the attacker (called landing page) and is inserted some malicious code inside. The user is then automatically redirected to an actual page that installs malware on the user's computer (called distribution page) without his/her consent or knowledge. While there is a large body of works targeting on detection of drive-by download attack, there is little attention on the redirection which is a crucial characteristic of the attack. In this paper, for the first time, we propose an approach to the classification of landing and distribution domains which are important components forming the head and tail of a redirection chain in the attack. The methodology in our approach is to use machine learning for text mining on the registered information of the domains called whois. We intensively implemented our approach with six popular supervised learning algorithms, compared the results and concluded that Linear-based Support Vector Machine and CART algorithm-based Decision Tree are the best models for our dataset which respectively give 98.55% and 99.28% of accuracy, 97.78% and 98.95% of F1 score, 98.35% and 99.45% of average precision.","PeriodicalId":170253,"journal":{"name":"2017 IEEE Trustcom/BigDataSE/ICESS","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Trustcom/BigDataSE/ICESS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Detection of drive-by-download attack has gained a focus in security research since the attack has turned into the most popular and serious threat to web infrastructure. The attack exploits vulnerabilities in web browsers and their extensions for unnoticeably downloading malicious software. Often, the victim is sent through a long chain of redirection operations in order to take down the offending pages. Concretely, the attack is triggered when a user visits a benign webpage that is compromised by the attacker (called landing page) and is inserted some malicious code inside. The user is then automatically redirected to an actual page that installs malware on the user's computer (called distribution page) without his/her consent or knowledge. While there is a large body of works targeting on detection of drive-by download attack, there is little attention on the redirection which is a crucial characteristic of the attack. In this paper, for the first time, we propose an approach to the classification of landing and distribution domains which are important components forming the head and tail of a redirection chain in the attack. The methodology in our approach is to use machine learning for text mining on the registered information of the domains called whois. We intensively implemented our approach with six popular supervised learning algorithms, compared the results and concluded that Linear-based Support Vector Machine and CART algorithm-based Decision Tree are the best models for our dataset which respectively give 98.55% and 99.28% of accuracy, 97.78% and 98.95% of F1 score, 98.35% and 99.45% of average precision.

查看原文本刊更多论文

基于Whois文本挖掘的着陆域和分布域分类

由于下载驱动攻击已成为网络基础设施中最常见和最严重的威胁，其检测已成为安全研究的焦点。这种攻击利用网络浏览器及其扩展的漏洞，悄无声息地下载恶意软件。通常，受害者被发送到一个长链的重定向操作，以删除违规页面。具体来说，当用户访问一个被攻击者破坏的良性网页(称为登陆页)并在其中插入一些恶意代码时，就会触发攻击。然后，在未经用户同意或不知情的情况下，用户被自动重定向到在用户计算机上安装恶意软件的实际页面(称为分发页面)。虽然针对驱动下载攻击的检测已有大量的工作，但对驱动下载攻击的关键特征重定向的关注却很少。在本文中，我们首次提出了一种分类着陆域和分布域的方法，它们是攻击中形成重定向链的头和尾的重要组成部分。我们的方法是使用机器学习对被称为whois的域名的注册信息进行文本挖掘。我们将我们的方法与六种流行的监督学习算法进行了深入的实践，比较了结果，得出基于线性的支持向量机和基于CART算法的决策树是我们数据集的最佳模型，分别给出了98.55%和99.28%的准确率，97.78%和98.95%的F1分数，98.35%和99.45%的平均精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Trustcom/BigDataSE/ICESS

自引率

0.00%

发文量