Enhancing Spam Email Classification Using Effective Preprocessing Strategies and Optimal Machine Learning Algorithms

Pramod P. Ghogare, Husain H. Dawoodi, Manoj P. Patil
{"title":"Enhancing Spam Email Classification Using Effective Preprocessing Strategies and Optimal Machine Learning Algorithms","authors":"Pramod P. Ghogare, Husain H. Dawoodi, Manoj P. Patil","doi":"10.17485/ijst/v17i15.2979","DOIUrl":null,"url":null,"abstract":"Objective: This article proposes a content-based spam email classification by applying various text pre-processing techniques. NLP techniques have been applied to pre-process the content of an email to get the optimal performance of spam email classification using machine learning. Method: Several combinations of pre-processing methods, such as stopping, removing tags, converting to lower case, removing punctuation, removing special characters, and natural language processing, were applied to the extracted content from the email with machine learning algorithms like NB, SVM, and RF to classify an email as ham or spam. The standard datasets like Enron and SpamAssassin, along with the personal email dataset collected from Yahoo Mail, are used to evaluate the performance of the models. Findings: Applying stemming in pre-processing to the RF classifier yielded the best results, achieving 99.2% accuracy on the SpamAssassin dataset and 99.3% accuracy on the Enron dataset. Lemmatization followed closely with 99% accuracy. In real-world testing on a personal Yahoo email dataset, the proposed method significantly improved accuracy from 89.82% to 97.28% compared to the email service provider's built-in classifier. Additionally, the study found that SVM performs accurately when stop words are retained. Novelty: This article introduces a unique perspective by highlighting the fine-tuning of pre-processing techniques. The focus is on removing tags and certain special characters, while retaining those that improve spam email classification accuracy. Unlike prior works that primarily emphasize algorithmic approaches and pre-defined processing functions, our research delves into the intricacies of data preparation, showcasing its significant impact on spam email classifiers. These findings emphasize the crucial role of pre-processing and contribute to a more nuanced understanding of effective strategies for robust spam detection. Keywords: Spam, Classification, Pre-processing, NLP, Machine Learning","PeriodicalId":13296,"journal":{"name":"Indian journal of science and technology","volume":"5 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indian journal of science and technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17485/ijst/v17i15.2979","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: This article proposes a content-based spam email classification by applying various text pre-processing techniques. NLP techniques have been applied to pre-process the content of an email to get the optimal performance of spam email classification using machine learning. Method: Several combinations of pre-processing methods, such as stopping, removing tags, converting to lower case, removing punctuation, removing special characters, and natural language processing, were applied to the extracted content from the email with machine learning algorithms like NB, SVM, and RF to classify an email as ham or spam. The standard datasets like Enron and SpamAssassin, along with the personal email dataset collected from Yahoo Mail, are used to evaluate the performance of the models. Findings: Applying stemming in pre-processing to the RF classifier yielded the best results, achieving 99.2% accuracy on the SpamAssassin dataset and 99.3% accuracy on the Enron dataset. Lemmatization followed closely with 99% accuracy. In real-world testing on a personal Yahoo email dataset, the proposed method significantly improved accuracy from 89.82% to 97.28% compared to the email service provider's built-in classifier. Additionally, the study found that SVM performs accurately when stop words are retained. Novelty: This article introduces a unique perspective by highlighting the fine-tuning of pre-processing techniques. The focus is on removing tags and certain special characters, while retaining those that improve spam email classification accuracy. Unlike prior works that primarily emphasize algorithmic approaches and pre-defined processing functions, our research delves into the intricacies of data preparation, showcasing its significant impact on spam email classifiers. These findings emphasize the crucial role of pre-processing and contribute to a more nuanced understanding of effective strategies for robust spam detection. Keywords: Spam, Classification, Pre-processing, NLP, Machine Learning
利用有效的预处理策略和最佳机器学习算法提高垃圾邮件分类能力
目的:本文通过应用各种文本预处理技术,提出了一种基于内容的垃圾邮件分类方法。本文采用 NLP 技术对邮件内容进行预处理,从而利用机器学习获得最佳的垃圾邮件分类性能。方法对从电子邮件中提取的内容应用了几种预处理方法组合,如停止、移除标签、转换为小写字母、移除标点符号、移除特殊字符和自然语言处理,并使用 NB、SVM 和 RF 等机器学习算法对电子邮件进行火腿或垃圾邮件分类。安然和 SpamAssassin 等标准数据集以及从雅虎邮箱收集的个人电子邮件数据集被用来评估模型的性能。研究结果在 RF 分类器的预处理中应用词干化取得了最佳效果,在 SpamAssassin 数据集上达到了 99.2% 的准确率,在安然数据集上达到了 99.3% 的准确率。Lemmatization 紧随其后,准确率达到 99%。在个人雅虎电子邮件数据集的实际测试中,与电子邮件服务提供商的内置分类器相比,建议的方法显著提高了准确率,从 89.82% 提高到 97.28%。此外,研究还发现,在保留停顿词的情况下,SVM 的表现也很准确。新颖性:这篇文章通过强调预处理技术的微调引入了一个独特的视角。重点在于去除标签和某些特殊字符,同时保留那些能提高垃圾邮件分类准确性的字符。与之前主要强调算法方法和预定义处理功能的研究不同,我们的研究深入探讨了数据准备的复杂性,展示了其对垃圾邮件分类器的重大影响。这些发现强调了预处理的关键作用,有助于人们更深入地了解垃圾邮件检测的有效策略。关键词垃圾邮件 分类 预处理 NLP 机器学习
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信