Enhancing Spam Email Classification Using Effective Preprocessing Strategies and Optimal Machine Learning Algorithms

Indian journal of science and technology Pub Date : 2024-04-16 DOI:10.17485/ijst/v17i15.2979

Pramod P. Ghogare, Husain H. Dawoodi, Manoj P. Patil

{"title":"Enhancing Spam Email Classification Using Effective Preprocessing Strategies and Optimal Machine Learning Algorithms","authors":"Pramod P. Ghogare, Husain H. Dawoodi, Manoj P. Patil","doi":"10.17485/ijst/v17i15.2979","DOIUrl":null,"url":null,"abstract":"Objective: This article proposes a content-based spam email classification by applying various text pre-processing techniques. NLP techniques have been applied to pre-process the content of an email to get the optimal performance of spam email classification using machine learning. Method: Several combinations of pre-processing methods, such as stopping, removing tags, converting to lower case, removing punctuation, removing special characters, and natural language processing, were applied to the extracted content from the email with machine learning algorithms like NB, SVM, and RF to classify an email as ham or spam. The standard datasets like Enron and SpamAssassin, along with the personal email dataset collected from Yahoo Mail, are used to evaluate the performance of the models. Findings: Applying stemming in pre-processing to the RF classifier yielded the best results, achieving 99.2% accuracy on the SpamAssassin dataset and 99.3% accuracy on the Enron dataset. Lemmatization followed closely with 99% accuracy. In real-world testing on a personal Yahoo email dataset, the proposed method significantly improved accuracy from 89.82% to 97.28% compared to the email service provider's built-in classifier. Additionally, the study found that SVM performs accurately when stop words are retained. Novelty: This article introduces a unique perspective by highlighting the fine-tuning of pre-processing techniques. The focus is on removing tags and certain special characters, while retaining those that improve spam email classification accuracy. Unlike prior works that primarily emphasize algorithmic approaches and pre-defined processing functions, our research delves into the intricacies of data preparation, showcasing its significant impact on spam email classifiers. These findings emphasize the crucial role of pre-processing and contribute to a more nuanced understanding of effective strategies for robust spam detection. Keywords: Spam, Classification, Pre-processing, NLP, Machine Learning","PeriodicalId":13296,"journal":{"name":"Indian journal of science and technology","volume":"5 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indian journal of science and technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17485/ijst/v17i15.2979","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This article proposes a content-based spam email classification by applying various text pre-processing techniques. NLP techniques have been applied to pre-process the content of an email to get the optimal performance of spam email classification using machine learning. Method: Several combinations of pre-processing methods, such as stopping, removing tags, converting to lower case, removing punctuation, removing special characters, and natural language processing, were applied to the extracted content from the email with machine learning algorithms like NB, SVM, and RF to classify an email as ham or spam. The standard datasets like Enron and SpamAssassin, along with the personal email dataset collected from Yahoo Mail, are used to evaluate the performance of the models. Findings: Applying stemming in pre-processing to the RF classifier yielded the best results, achieving 99.2% accuracy on the SpamAssassin dataset and 99.3% accuracy on the Enron dataset. Lemmatization followed closely with 99% accuracy. In real-world testing on a personal Yahoo email dataset, the proposed method significantly improved accuracy from 89.82% to 97.28% compared to the email service provider's built-in classifier. Additionally, the study found that SVM performs accurately when stop words are retained. Novelty: This article introduces a unique perspective by highlighting the fine-tuning of pre-processing techniques. The focus is on removing tags and certain special characters, while retaining those that improve spam email classification accuracy. Unlike prior works that primarily emphasize algorithmic approaches and pre-defined processing functions, our research delves into the intricacies of data preparation, showcasing its significant impact on spam email classifiers. These findings emphasize the crucial role of pre-processing and contribute to a more nuanced understanding of effective strategies for robust spam detection. Keywords: Spam, Classification, Pre-processing, NLP, Machine Learning

查看原文本刊更多论文

利用有效的预处理策略和最佳机器学习算法提高垃圾邮件分类能力

目的：本文通过应用各种文本预处理技术，提出了一种基于内容的垃圾邮件分类方法。本文采用 NLP 技术对邮件内容进行预处理，从而利用机器学习获得最佳的垃圾邮件分类性能。方法对从电子邮件中提取的内容应用了几种预处理方法组合，如停止、移除标签、转换为小写字母、移除标点符号、移除特殊字符和自然语言处理，并使用 NB、SVM 和 RF 等机器学习算法对电子邮件进行火腿或垃圾邮件分类。安然和 SpamAssassin 等标准数据集以及从雅虎邮箱收集的个人电子邮件数据集被用来评估模型的性能。研究结果在 RF 分类器的预处理中应用词干化取得了最佳效果，在 SpamAssassin 数据集上达到了 99.2% 的准确率，在安然数据集上达到了 99.3% 的准确率。Lemmatization 紧随其后，准确率达到 99%。在个人雅虎电子邮件数据集的实际测试中，与电子邮件服务提供商的内置分类器相比，建议的方法显著提高了准确率，从 89.82% 提高到 97.28%。此外，研究还发现，在保留停顿词的情况下，SVM 的表现也很准确。新颖性：这篇文章通过强调预处理技术的微调引入了一个独特的视角。重点在于去除标签和某些特殊字符，同时保留那些能提高垃圾邮件分类准确性的字符。与之前主要强调算法方法和预定义处理功能的研究不同，我们的研究深入探讨了数据准备的复杂性，展示了其对垃圾邮件分类器的重大影响。这些发现强调了预处理的关键作用，有助于人们更深入地了解垃圾邮件检测的有效策略。关键词垃圾邮件分类预处理 NLP 机器学习

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Indian journal of science and technology

自引率

0.00%

发文量