基于ID3算法和隐马尔可夫模型的垃圾邮件检测

2018 Conference on Information and Communication Technology (CICT) Pub Date : 2018-10-01 DOI:10.1109/INFOCOMTECH.2018.8722378

V. Kumar, Monika, Parveen Kumar, Ambalika Sharma

{"title":"基于ID3算法和隐马尔可夫模型的垃圾邮件检测","authors":"V. Kumar, Monika, Parveen Kumar, Ambalika Sharma","doi":"10.1109/INFOCOMTECH.2018.8722378","DOIUrl":null,"url":null,"abstract":"Emails are the way to communicate over the Internet but this method of communication is bothersome by the Spam emails. Spam emails are the waste of memory, money, time and communication bandwidth. Thus, Spam emails needed to be identified and culminated. Hence, use of the ID3 algorithm for making the decision trees and the Hidden Markov Model for calculating the probabilities of the events that may occur is used in this paper as a combination to identify the emails as Spam or ham. The model labels the emails as Spam or ham by calculating total probability of an email using all posteriorly classified words in emails and then supervising all processed emails by making their decision trees. For this purpose, an Enron dataset of 5172 emails is used that contains 2086 Spam and 2086 ham pre-classified emails. The experimental result on the given dataset shows that an accuracy of 89% is obtained on the Spam emails.","PeriodicalId":175757,"journal":{"name":"2018 Conference on Information and Communication Technology (CICT)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Spam Email Detection using ID3 Algorithm and Hidden Markov Model\",\"authors\":\"V. Kumar, Monika, Parveen Kumar, Ambalika Sharma\",\"doi\":\"10.1109/INFOCOMTECH.2018.8722378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emails are the way to communicate over the Internet but this method of communication is bothersome by the Spam emails. Spam emails are the waste of memory, money, time and communication bandwidth. Thus, Spam emails needed to be identified and culminated. Hence, use of the ID3 algorithm for making the decision trees and the Hidden Markov Model for calculating the probabilities of the events that may occur is used in this paper as a combination to identify the emails as Spam or ham. The model labels the emails as Spam or ham by calculating total probability of an email using all posteriorly classified words in emails and then supervising all processed emails by making their decision trees. For this purpose, an Enron dataset of 5172 emails is used that contains 2086 Spam and 2086 ham pre-classified emails. The experimental result on the given dataset shows that an accuracy of 89% is obtained on the Spam emails.\",\"PeriodicalId\":175757,\"journal\":{\"name\":\"2018 Conference on Information and Communication Technology (CICT)\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 Conference on Information and Communication Technology (CICT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INFOCOMTECH.2018.8722378\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Conference on Information and Communication Technology (CICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFOCOMTECH.2018.8722378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

电子邮件是通过互联网进行通信的方式，但这种通信方式被垃圾邮件所困扰。垃圾邮件是对记忆、金钱、时间和通信带宽的浪费。因此，垃圾邮件需要被识别和终结。因此，本文将使用ID3算法制作决策树和隐马尔可夫模型计算可能发生的事件的概率作为组合来识别垃圾邮件或火腿。该模型通过使用邮件中所有后分类词计算邮件的总概率，然后通过制定决策树来监督所有处理过的邮件，从而将邮件标记为Spam或ham。为此，使用了包含5172封电子邮件的安然数据集，其中包含2086封垃圾邮件和2086封普通预分类电子邮件。在给定数据集上的实验结果表明，该方法对垃圾邮件的识别准确率达到89%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Spam Email Detection using ID3 Algorithm and Hidden Markov Model

Emails are the way to communicate over the Internet but this method of communication is bothersome by the Spam emails. Spam emails are the waste of memory, money, time and communication bandwidth. Thus, Spam emails needed to be identified and culminated. Hence, use of the ID3 algorithm for making the decision trees and the Hidden Markov Model for calculating the probabilities of the events that may occur is used in this paper as a combination to identify the emails as Spam or ham. The model labels the emails as Spam or ham by calculating total probability of an email using all posteriorly classified words in emails and then supervising all processed emails by making their decision trees. For this purpose, an Enron dataset of 5172 emails is used that contains 2086 Spam and 2086 ham pre-classified emails. The experimental result on the given dataset shows that an accuracy of 89% is obtained on the Spam emails.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 Conference on Information and Communication Technology (CICT)

自引率

0.00%

发文量