字符串匹配算法在垃圾邮件检测中的比较

2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT) Pub Date : 2018-12-01 DOI:10.1109/IBIGDELFT.2018.8625317

C. Varol, H. Abdulhadi

{"title":"字符串匹配算法在垃圾邮件检测中的比较","authors":"C. Varol, H. Abdulhadi","doi":"10.1109/IBIGDELFT.2018.8625317","DOIUrl":null,"url":null,"abstract":"Email is one of the most expedient approach to transfer messages among people all over the world. Its features, specifically reliability, quickness, and low cost makes it popular and useful among people in most parts of businesses and society. On the other hand, this popularity also created new harmful actions, such as email attacks (spam) in cyberspace. Spam is arguably one of the main reasons of drowning the WWW with many copies of similar messages generated through anonymous senders, which yields to time/space wasting of the email account holder and also a large virus and malware threat to Email providers. In spite of employing various filters to handle spam problem such as machine learning and content-based filtering, spammers are still able to bypass these defense mechanisms. In this paper, we investigate the use of string matching algorithms for spam email detection. Particularly this work examines and compares the efficiency of six well-known string matching algorithms, namely Longest Common Subsequence (LCS), Levenshtein Distance (LD), Jaro, Jaro-Winkler, Bi-gram, and TFIDF on two various datasets which are Enron corpus and CSDMC2010 spam dataset. We observed that Bi-gram algorithm performs best in spam detection in both datasets.","PeriodicalId":290302,"journal":{"name":"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Comparision of String Matching Algorithms on Spam Email Detection\",\"authors\":\"C. Varol, H. Abdulhadi\",\"doi\":\"10.1109/IBIGDELFT.2018.8625317\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Email is one of the most expedient approach to transfer messages among people all over the world. Its features, specifically reliability, quickness, and low cost makes it popular and useful among people in most parts of businesses and society. On the other hand, this popularity also created new harmful actions, such as email attacks (spam) in cyberspace. Spam is arguably one of the main reasons of drowning the WWW with many copies of similar messages generated through anonymous senders, which yields to time/space wasting of the email account holder and also a large virus and malware threat to Email providers. In spite of employing various filters to handle spam problem such as machine learning and content-based filtering, spammers are still able to bypass these defense mechanisms. In this paper, we investigate the use of string matching algorithms for spam email detection. Particularly this work examines and compares the efficiency of six well-known string matching algorithms, namely Longest Common Subsequence (LCS), Levenshtein Distance (LD), Jaro, Jaro-Winkler, Bi-gram, and TFIDF on two various datasets which are Enron corpus and CSDMC2010 spam dataset. We observed that Bi-gram algorithm performs best in spam detection in both datasets.\",\"PeriodicalId\":290302,\"journal\":{\"name\":\"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IBIGDELFT.2018.8625317\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IBIGDELFT.2018.8625317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

电子邮件是在世界各地的人们之间传递信息的最方便的方法之一。它的特点，特别是可靠、快速和低成本，使它在商业和社会的大部分领域受到人们的欢迎和使用。另一方面，这种流行也产生了新的有害行为，例如网络空间中的电子邮件攻击(垃圾邮件)。垃圾邮件可以说是淹没WWW的主要原因之一，因为匿名发送者产生了许多类似的邮件副本，这不仅浪费了电子邮件帐户持有人的时间/空间，也给电子邮件提供商带来了巨大的病毒和恶意软件威胁。尽管使用了各种过滤器来处理垃圾邮件问题，例如机器学习和基于内容的过滤，但垃圾邮件发送者仍然能够绕过这些防御机制。在本文中，我们研究了使用字符串匹配算法来检测垃圾邮件。特别是这项工作检查并比较了六种著名的字符串匹配算法的效率，即最长公共子序列(LCS)， Levenshtein距离(LD)， Jaro, Jaro- winkler, Bi-gram和TFIDF在安然语料库和CSDMC2010垃圾邮件数据集两个不同的数据集上的效率。我们观察到，在两个数据集中，双图算法在垃圾邮件检测中表现最好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparision of String Matching Algorithms on Spam Email Detection

Email is one of the most expedient approach to transfer messages among people all over the world. Its features, specifically reliability, quickness, and low cost makes it popular and useful among people in most parts of businesses and society. On the other hand, this popularity also created new harmful actions, such as email attacks (spam) in cyberspace. Spam is arguably one of the main reasons of drowning the WWW with many copies of similar messages generated through anonymous senders, which yields to time/space wasting of the email account holder and also a large virus and malware threat to Email providers. In spite of employing various filters to handle spam problem such as machine learning and content-based filtering, spammers are still able to bypass these defense mechanisms. In this paper, we investigate the use of string matching algorithms for spam email detection. Particularly this work examines and compares the efficiency of six well-known string matching algorithms, namely Longest Common Subsequence (LCS), Levenshtein Distance (LD), Jaro, Jaro-Winkler, Bi-gram, and TFIDF on two various datasets which are Enron corpus and CSDMC2010 spam dataset. We observed that Bi-gram algorithm performs best in spam detection in both datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)

自引率

0.00%

发文量