{"title":"字符串匹配算法在垃圾邮件检测中的比较","authors":"C. Varol, H. Abdulhadi","doi":"10.1109/IBIGDELFT.2018.8625317","DOIUrl":null,"url":null,"abstract":"Email is one of the most expedient approach to transfer messages among people all over the world. Its features, specifically reliability, quickness, and low cost makes it popular and useful among people in most parts of businesses and society. On the other hand, this popularity also created new harmful actions, such as email attacks (spam) in cyberspace. Spam is arguably one of the main reasons of drowning the WWW with many copies of similar messages generated through anonymous senders, which yields to time/space wasting of the email account holder and also a large virus and malware threat to Email providers. In spite of employing various filters to handle spam problem such as machine learning and content-based filtering, spammers are still able to bypass these defense mechanisms. In this paper, we investigate the use of string matching algorithms for spam email detection. Particularly this work examines and compares the efficiency of six well-known string matching algorithms, namely Longest Common Subsequence (LCS), Levenshtein Distance (LD), Jaro, Jaro-Winkler, Bi-gram, and TFIDF on two various datasets which are Enron corpus and CSDMC2010 spam dataset. We observed that Bi-gram algorithm performs best in spam detection in both datasets.","PeriodicalId":290302,"journal":{"name":"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Comparision of String Matching Algorithms on Spam Email Detection\",\"authors\":\"C. Varol, H. Abdulhadi\",\"doi\":\"10.1109/IBIGDELFT.2018.8625317\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Email is one of the most expedient approach to transfer messages among people all over the world. Its features, specifically reliability, quickness, and low cost makes it popular and useful among people in most parts of businesses and society. On the other hand, this popularity also created new harmful actions, such as email attacks (spam) in cyberspace. Spam is arguably one of the main reasons of drowning the WWW with many copies of similar messages generated through anonymous senders, which yields to time/space wasting of the email account holder and also a large virus and malware threat to Email providers. In spite of employing various filters to handle spam problem such as machine learning and content-based filtering, spammers are still able to bypass these defense mechanisms. In this paper, we investigate the use of string matching algorithms for spam email detection. Particularly this work examines and compares the efficiency of six well-known string matching algorithms, namely Longest Common Subsequence (LCS), Levenshtein Distance (LD), Jaro, Jaro-Winkler, Bi-gram, and TFIDF on two various datasets which are Enron corpus and CSDMC2010 spam dataset. We observed that Bi-gram algorithm performs best in spam detection in both datasets.\",\"PeriodicalId\":290302,\"journal\":{\"name\":\"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IBIGDELFT.2018.8625317\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IBIGDELFT.2018.8625317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Comparision of String Matching Algorithms on Spam Email Detection
Email is one of the most expedient approach to transfer messages among people all over the world. Its features, specifically reliability, quickness, and low cost makes it popular and useful among people in most parts of businesses and society. On the other hand, this popularity also created new harmful actions, such as email attacks (spam) in cyberspace. Spam is arguably one of the main reasons of drowning the WWW with many copies of similar messages generated through anonymous senders, which yields to time/space wasting of the email account holder and also a large virus and malware threat to Email providers. In spite of employing various filters to handle spam problem such as machine learning and content-based filtering, spammers are still able to bypass these defense mechanisms. In this paper, we investigate the use of string matching algorithms for spam email detection. Particularly this work examines and compares the efficiency of six well-known string matching algorithms, namely Longest Common Subsequence (LCS), Levenshtein Distance (LD), Jaro, Jaro-Winkler, Bi-gram, and TFIDF on two various datasets which are Enron corpus and CSDMC2010 spam dataset. We observed that Bi-gram algorithm performs best in spam detection in both datasets.