A Modified Naïve Bayesian-based Spam Filter using Support Vector Machine

2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT) Pub Date : 2019-05-01 DOI:10.1109/ICASERT.2019.8934629

Md. Sabir Hossain, M. Zubair, Mohammad Obaidur Rahman, Muhammad Kamrul Hossain Patwary, Md. Golam Sarwar Rajib

{"title":"A Modified Naïve Bayesian-based Spam Filter using Support Vector Machine","authors":"Md. Sabir Hossain, M. Zubair, Mohammad Obaidur Rahman, Muhammad Kamrul Hossain Patwary, Md. Golam Sarwar Rajib","doi":"10.1109/ICASERT.2019.8934629","DOIUrl":null,"url":null,"abstract":"The ever-growing problem which is threatening the current mailing system is spam. Spam is nothing but an unsolicited bulk e-mail frequently sent in a financial nature which generates the need for creating an anti-spam filter. Amongst many spam filtering techniques, the most advanced method \"Naïve Bayesian filtering\" using the Support Vector Machine (SVM) have been implemented. Spammers are very careful about the filtering techniques. For that very reason, dynamic filtering is needed and the proposed method meets the demand. The algorithm splits the received email into tokens and uses Bayes' theorem of probability to calculate the probability of spam for each token to determine the total spam probability of the mail. Implementation of SVM instead of corpora is one of the added features of the algorithm. The most challenging feature was to take the words as well as whole sentences as input in the SVM as tokens and feature vectors. The inclusion of sentences in the dataset training has increased the accuracy of detecting spam and ham. Natural Language Tool Kit (NLTK) has been used as a useful language processing tool to tokenize the sentences and also to understand the meaning of the same types of sentences to some extent. As a test mail is being compared by word to word and also sentence to sentence from the training datasets to determine if the mail is spam or not, it will improve the performance of the filter. With some simple modifications, the filter can be used in both server and client end. The efficiency increases gradually with the increased number of email it processes.","PeriodicalId":6613,"journal":{"name":"2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)","volume":"30 1","pages":"1-7"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASERT.2019.8934629","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The ever-growing problem which is threatening the current mailing system is spam. Spam is nothing but an unsolicited bulk e-mail frequently sent in a financial nature which generates the need for creating an anti-spam filter. Amongst many spam filtering techniques, the most advanced method "Naïve Bayesian filtering" using the Support Vector Machine (SVM) have been implemented. Spammers are very careful about the filtering techniques. For that very reason, dynamic filtering is needed and the proposed method meets the demand. The algorithm splits the received email into tokens and uses Bayes' theorem of probability to calculate the probability of spam for each token to determine the total spam probability of the mail. Implementation of SVM instead of corpora is one of the added features of the algorithm. The most challenging feature was to take the words as well as whole sentences as input in the SVM as tokens and feature vectors. The inclusion of sentences in the dataset training has increased the accuracy of detecting spam and ham. Natural Language Tool Kit (NLTK) has been used as a useful language processing tool to tokenize the sentences and also to understand the meaning of the same types of sentences to some extent. As a test mail is being compared by word to word and also sentence to sentence from the training datasets to determine if the mail is spam or not, it will improve the performance of the filter. With some simple modifications, the filter can be used in both server and client end. The efficiency increases gradually with the increased number of email it processes.

查看原文本刊更多论文

基于支持向量机的改进Naïve贝叶斯垃圾邮件过滤器

威胁当前邮件系统的日益严重的问题是垃圾邮件。垃圾邮件只不过是未经请求的大量电子邮件，经常以金融性质发送，因此需要创建反垃圾邮件过滤器。在众多垃圾邮件过滤技术中，使用支持向量机(SVM)的最先进的方法“Naïve贝叶斯过滤”已经实现。垃圾邮件发送者对过滤技术非常小心。因此，需要进行动态滤波，所提出的方法满足了这一要求。该算法将收到的电子邮件分成令牌，并利用贝叶斯概率定理计算每个令牌的垃圾邮件概率，从而确定该邮件的总垃圾邮件概率。支持向量机代替语料库的实现是该算法的附加特征之一。最具挑战性的特征是将单词和整个句子作为支持向量机的输入作为标记和特征向量。在数据集训练中加入句子提高了检测spam和ham的准确性。自然语言工具箱(Natural Language Tool Kit, NLTK)是一种有用的语言处理工具，用于对句子进行标记，并在一定程度上理解同类型句子的意义。由于测试邮件会从训练数据集中逐字逐句地进行比较，以确定邮件是否为垃圾邮件，这将提高过滤器的性能。通过一些简单的修改，过滤器可以在服务器端和客户端使用。效率随着它处理的电子邮件数量的增加而逐渐提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT)

自引率

0.00%

发文量