An Online Malicious Spam Email Detection System Using Resource Allocating Network with Locality Sensitive Hashing

Siti-Hajar-Aminah Ali, S. Ozawa, J. Nakazato, Tao Ban, Jumpei Shimamura
{"title":"An Online Malicious Spam Email Detection System Using Resource Allocating Network with Locality Sensitive Hashing","authors":"Siti-Hajar-Aminah Ali, S. Ozawa, J. Nakazato, Tao Ban, Jumpei Shimamura","doi":"10.4236/JILSA.2015.72005","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a new online system that can quickly detect malicious spam emails and adapt to the changes in the email contents and the Uniform Resource Locator (URL) links leading to malicious websites by updating the system daily. We introduce an autonomous function for a server to generate training examples, in which double-bounce emails are automatically collected and their class labels are given by a crawler-type software to analyze the website maliciousness called SPIKE. In general, since spammers use botnets to spread numerous malicious emails within a short time, such distributed spam emails often have the same or similar contents. Therefore, it is not necessary for all spam emails to be learned. To adapt to new malicious campaigns quickly, only new types of spam emails should be selected for learning and this can be realized by introducing an active learning scheme into a classifier model. For this purpose, we adopt Resource Allocating Network with Locality Sensitive Hashing (RAN-LSH) as a classifier model with a data selection function. In RAN-LSH, the same or similar spam emails that have already been learned are quickly searched for a hash table in Locally Sensitive Hashing (LSH), in which the matched similar emails located in “well-learned” are discarded without being used as training data. To analyze email contents, we adopt the Bag of Words (BoW) approach and generate feature vectors whose attributes are transformed based on the normalized term frequency-inverse document frequency (TF-IDF). We use a data set of double-bounce spam emails collected at National Institute of Information and Communications Technology (NICT) in Japan from March 1st, 2013 until May 10th, 2013 to evaluate the performance of the proposed system. The results confirm that the proposed spam email detection system has capability of detecting with high detection rate.","PeriodicalId":69452,"journal":{"name":"智能学习系统与应用(英文)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2015-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"智能学习系统与应用(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.4236/JILSA.2015.72005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

In this paper, we propose a new online system that can quickly detect malicious spam emails and adapt to the changes in the email contents and the Uniform Resource Locator (URL) links leading to malicious websites by updating the system daily. We introduce an autonomous function for a server to generate training examples, in which double-bounce emails are automatically collected and their class labels are given by a crawler-type software to analyze the website maliciousness called SPIKE. In general, since spammers use botnets to spread numerous malicious emails within a short time, such distributed spam emails often have the same or similar contents. Therefore, it is not necessary for all spam emails to be learned. To adapt to new malicious campaigns quickly, only new types of spam emails should be selected for learning and this can be realized by introducing an active learning scheme into a classifier model. For this purpose, we adopt Resource Allocating Network with Locality Sensitive Hashing (RAN-LSH) as a classifier model with a data selection function. In RAN-LSH, the same or similar spam emails that have already been learned are quickly searched for a hash table in Locally Sensitive Hashing (LSH), in which the matched similar emails located in “well-learned” are discarded without being used as training data. To analyze email contents, we adopt the Bag of Words (BoW) approach and generate feature vectors whose attributes are transformed based on the normalized term frequency-inverse document frequency (TF-IDF). We use a data set of double-bounce spam emails collected at National Institute of Information and Communications Technology (NICT) in Japan from March 1st, 2013 until May 10th, 2013 to evaluate the performance of the proposed system. The results confirm that the proposed spam email detection system has capability of detecting with high detection rate.
基于位置敏感哈希的资源分配网络在线恶意垃圾邮件检测系统
本文提出了一种新的在线系统,该系统可以快速检测恶意垃圾邮件,并通过每日更新来适应邮件内容的变化和指向恶意网站的统一资源定位符(URL)链接。我们为服务器引入了一个自动生成训练样例的功能,其中自动收集双跳邮件,并由爬虫类软件给出其类标签,以分析网站恶意,称为SPIKE。一般情况下,由于垃圾邮件发送者利用僵尸网络在短时间内传播大量的恶意邮件,这种分布的垃圾邮件往往具有相同或相似的内容。因此,没有必要对所有的垃圾邮件进行学习。为了快速适应新的恶意活动,只需要选择新的垃圾邮件类型进行学习,这可以通过在分类器模型中引入主动学习方案来实现。为此,我们采用local Sensitive hash (lan - lsh)资源分配网络作为具有数据选择功能的分类器模型。在lan -LSH中,对于已经学习到的相同或相似的垃圾邮件,通过局部敏感哈希(local Sensitive hash, LSH)快速搜索到一个哈希表,其中位于“良好学习”的匹配的相似邮件将被丢弃,而不作为训练数据。为了分析电子邮件内容,我们采用词包(BoW)方法,生成特征向量,特征向量的属性根据归一化词频-逆文档频率(TF-IDF)进行变换。我们使用日本国立信息通信技术研究所(NICT)从2013年3月1日至2013年5月10日收集的双反弹垃圾邮件数据集来评估所提出系统的性能。结果表明,所提出的垃圾邮件检测系统具有较高的检测率和检测能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
135
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信