基于特征选择和数据转换的高精度垃圾邮件检测

Int. Arab J. Inf. Technol. Pub Date : 2023-01-01 DOI:10.34028/iajit/20/1/4

Hidayet Takçi, Nusrat Fatema

{"title":"基于特征选择和数据转换的高精度垃圾邮件检测","authors":"Hidayet Takçi, Nusrat Fatema","doi":"10.34028/iajit/20/1/4","DOIUrl":null,"url":null,"abstract":"The amount of spam is increasing rapidly while the popularity of emails is increasing. This situation has led to the need to filter spam emails. To date, many knowledge-based, learning-based, and clustering-based methods have been developed for filtering spam emails. In this study, machine-learning-based spam detection was targeted, and C4.5, ID3, RndTree, C-Support Vector Classification (C-SVC), and Naïve Bayes algorithms were used for email spam detection. In addition, feature selection and data transformation methods were used to increase spam detection success. Experiments were performed on the UC Irvine Machine Learning Repository (UCI) spambase dataset, and the results were compared for accuracy, Receiver Operating Characteristic (ROC) analysis, and classification speed. According to the accuracy comparison, the C-SVC algorithm gave the highest accuracy with 93.13%, followed by the RndTree algorithm. According to the ROC analysis, the RndTree algorithm gave the best Area Under Curve (AUC) value of 0.999, while the C4.5 algorithm gave the second-best result. The most successful methods in terms of classification speed are Naïve Bayes and RndTree algorithms. In the experiments, it was seen that feature selection and data transformation methods increased spam detection success. The binary transformation that increased the classification success the most and the feature selection method was forward selection.","PeriodicalId":13624,"journal":{"name":"Int. Arab J. Inf. Technol.","volume":"30 1","pages":"29-37"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Highly Accurate Spam Detection with the Help of Feature Selection and Data Transformation\",\"authors\":\"Hidayet Takçi, Nusrat Fatema\",\"doi\":\"10.34028/iajit/20/1/4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The amount of spam is increasing rapidly while the popularity of emails is increasing. This situation has led to the need to filter spam emails. To date, many knowledge-based, learning-based, and clustering-based methods have been developed for filtering spam emails. In this study, machine-learning-based spam detection was targeted, and C4.5, ID3, RndTree, C-Support Vector Classification (C-SVC), and Naïve Bayes algorithms were used for email spam detection. In addition, feature selection and data transformation methods were used to increase spam detection success. Experiments were performed on the UC Irvine Machine Learning Repository (UCI) spambase dataset, and the results were compared for accuracy, Receiver Operating Characteristic (ROC) analysis, and classification speed. According to the accuracy comparison, the C-SVC algorithm gave the highest accuracy with 93.13%, followed by the RndTree algorithm. According to the ROC analysis, the RndTree algorithm gave the best Area Under Curve (AUC) value of 0.999, while the C4.5 algorithm gave the second-best result. The most successful methods in terms of classification speed are Naïve Bayes and RndTree algorithms. In the experiments, it was seen that feature selection and data transformation methods increased spam detection success. The binary transformation that increased the classification success the most and the feature selection method was forward selection.\",\"PeriodicalId\":13624,\"journal\":{\"name\":\"Int. Arab J. Inf. Technol.\",\"volume\":\"30 1\",\"pages\":\"29-37\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. Arab J. Inf. Technol.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.34028/iajit/20/1/4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. Arab J. Inf. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34028/iajit/20/1/4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

随着电子邮件的普及，垃圾邮件的数量也在迅速增加。这种情况导致需要过滤垃圾邮件。迄今为止，已经开发了许多基于知识、基于学习和基于聚类的方法来过滤垃圾邮件。本研究以基于机器学习的垃圾邮件检测为目标，采用C4.5、ID3、RndTree、c -支持向量分类(C-SVC)和Naïve贝叶斯算法进行垃圾邮件检测。此外，使用特征选择和数据转换方法来提高垃圾邮件检测的成功率。实验在UC Irvine Machine Learning Repository (UCI) spambase数据集上进行，并对结果进行准确率、Receiver Operating Characteristic (ROC)分析和分类速度的比较。从准确率对比来看，C-SVC算法的准确率最高，为93.13%，RndTree算法次之。根据ROC分析，RndTree算法的最佳曲线下面积(Area Under Curve, AUC)值为0.999,C4.5算法的次之。在分类速度方面最成功的方法是Naïve Bayes和RndTree算法。实验表明，特征选择和数据转换方法提高了垃圾邮件检测的成功率。二值变换对分类成功率提高最大，特征选择方法为正向选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Highly Accurate Spam Detection with the Help of Feature Selection and Data Transformation

The amount of spam is increasing rapidly while the popularity of emails is increasing. This situation has led to the need to filter spam emails. To date, many knowledge-based, learning-based, and clustering-based methods have been developed for filtering spam emails. In this study, machine-learning-based spam detection was targeted, and C4.5, ID3, RndTree, C-Support Vector Classification (C-SVC), and Naïve Bayes algorithms were used for email spam detection. In addition, feature selection and data transformation methods were used to increase spam detection success. Experiments were performed on the UC Irvine Machine Learning Repository (UCI) spambase dataset, and the results were compared for accuracy, Receiver Operating Characteristic (ROC) analysis, and classification speed. According to the accuracy comparison, the C-SVC algorithm gave the highest accuracy with 93.13%, followed by the RndTree algorithm. According to the ROC analysis, the RndTree algorithm gave the best Area Under Curve (AUC) value of 0.999, while the C4.5 algorithm gave the second-best result. The most successful methods in terms of classification speed are Naïve Bayes and RndTree algorithms. In the experiments, it was seen that feature selection and data transformation methods increased spam detection success. The binary transformation that increased the classification success the most and the feature selection method was forward selection.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Int. Arab J. Inf. Technol.

自引率

0.00%

发文量