阿拉伯语垃圾推文分类:全面的机器学习方法

AI Pub Date : 2024-07-02 DOI:10.3390/ai5030052
W. Hantom, Atta Rahman
{"title":"阿拉伯语垃圾推文分类:全面的机器学习方法","authors":"W. Hantom, Atta Rahman","doi":"10.3390/ai5030052","DOIUrl":null,"url":null,"abstract":"Nowadays, one of the most common problems faced by Twitter (also known as X) users, including individuals as well as organizations, is dealing with spam tweets. The problem continues to proliferate due to the increasing popularity and number of users of social media platforms. Due to this overwhelming interest, spammers can post texts, images, and videos containing suspicious links that can be used to spread viruses, rumors, negative marketing, and sarcasm, and potentially hack the user’s information. Spam detection is among the hottest research areas in natural language processing (NLP) and cybersecurity. Several studies have been conducted in this regard, but they mainly focus on the English language. However, Arabic tweet spam detection still has a long way to go, especially emphasizing the diverse dialects other than modern standard Arabic (MSA), since, in the tweets, the standard dialect is seldom used. The situation demands an automated, robust, and efficient Arabic spam tweet detection approach. To address the issue, in this research, various machine learning and deep learning models have been investigated to detect spam tweets in Arabic, including Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB) and Long-Short Term Memory (LSTM). In this regard, we have focused on the words as well as the meaning of the tweet text. Upon several experiments, the proposed models have produced promising results in contrast to the previous approaches for the same and diverse datasets. The results showed that the RF classifier achieved 96.78% and the LSTM classifier achieved 94.56%, followed by the SVM classifier that achieved 82% accuracy. Further, in terms of F1-score, there is an improvement of 21.38%, 19.16% and 5.2% using RF, LSTM and SVM classifiers compared to the schemes with same dataset.","PeriodicalId":503525,"journal":{"name":"AI","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Arabic Spam Tweets Classification: A Comprehensive Machine Learning Approach\",\"authors\":\"W. Hantom, Atta Rahman\",\"doi\":\"10.3390/ai5030052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, one of the most common problems faced by Twitter (also known as X) users, including individuals as well as organizations, is dealing with spam tweets. The problem continues to proliferate due to the increasing popularity and number of users of social media platforms. Due to this overwhelming interest, spammers can post texts, images, and videos containing suspicious links that can be used to spread viruses, rumors, negative marketing, and sarcasm, and potentially hack the user’s information. Spam detection is among the hottest research areas in natural language processing (NLP) and cybersecurity. Several studies have been conducted in this regard, but they mainly focus on the English language. However, Arabic tweet spam detection still has a long way to go, especially emphasizing the diverse dialects other than modern standard Arabic (MSA), since, in the tweets, the standard dialect is seldom used. The situation demands an automated, robust, and efficient Arabic spam tweet detection approach. To address the issue, in this research, various machine learning and deep learning models have been investigated to detect spam tweets in Arabic, including Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB) and Long-Short Term Memory (LSTM). In this regard, we have focused on the words as well as the meaning of the tweet text. Upon several experiments, the proposed models have produced promising results in contrast to the previous approaches for the same and diverse datasets. The results showed that the RF classifier achieved 96.78% and the LSTM classifier achieved 94.56%, followed by the SVM classifier that achieved 82% accuracy. Further, in terms of F1-score, there is an improvement of 21.38%, 19.16% and 5.2% using RF, LSTM and SVM classifiers compared to the schemes with same dataset.\",\"PeriodicalId\":503525,\"journal\":{\"name\":\"AI\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/ai5030052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/ai5030052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

如今,Twitter(又称 X)用户(包括个人和组织)面临的最常见问题之一就是处理垃圾推文。由于社交媒体平台越来越受欢迎,用户数量也越来越多,这个问题不断扩散。由于人们对垃圾推文的兴趣与日俱增,垃圾推文发送者可以发布包含可疑链接的文本、图片和视频,这些链接可用于传播病毒、谣言、负面营销和讽刺,还有可能入侵用户的信息。垃圾邮件检测是自然语言处理(NLP)和网络安全领域最热门的研究领域之一。在这方面已经开展了多项研究,但主要集中在英语领域。然而,阿拉伯语推文垃圾邮件检测仍有很长的路要走,尤其是强调现代标准阿拉伯语(MSA)以外的各种方言,因为在推文中很少使用标准方言。这种情况需要一种自动、稳健、高效的阿拉伯语垃圾推文检测方法。为解决这一问题,本研究研究了各种机器学习和深度学习模型,以检测阿拉伯语垃圾推文,包括随机森林(RF)、支持向量机(SVM)、奈夫贝叶斯(NB)和长短期记忆(LSTM)。在这方面,我们的重点是推文文本的单词和含义。经过多次实验,在相同和不同的数据集上,与之前的方法相比,所提出的模型产生了很好的结果。结果显示,RF 分类器的准确率达到 96.78%,LSTM 分类器的准确率达到 94.56%,其次是 SVM 分类器,准确率达到 82%。此外,在 F1 分数方面,使用 RF、LSTM 和 SVM 分类器与使用相同数据集的方案相比,分别提高了 21.38%、19.16% 和 5.2%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Arabic Spam Tweets Classification: A Comprehensive Machine Learning Approach
Nowadays, one of the most common problems faced by Twitter (also known as X) users, including individuals as well as organizations, is dealing with spam tweets. The problem continues to proliferate due to the increasing popularity and number of users of social media platforms. Due to this overwhelming interest, spammers can post texts, images, and videos containing suspicious links that can be used to spread viruses, rumors, negative marketing, and sarcasm, and potentially hack the user’s information. Spam detection is among the hottest research areas in natural language processing (NLP) and cybersecurity. Several studies have been conducted in this regard, but they mainly focus on the English language. However, Arabic tweet spam detection still has a long way to go, especially emphasizing the diverse dialects other than modern standard Arabic (MSA), since, in the tweets, the standard dialect is seldom used. The situation demands an automated, robust, and efficient Arabic spam tweet detection approach. To address the issue, in this research, various machine learning and deep learning models have been investigated to detect spam tweets in Arabic, including Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB) and Long-Short Term Memory (LSTM). In this regard, we have focused on the words as well as the meaning of the tweet text. Upon several experiments, the proposed models have produced promising results in contrast to the previous approaches for the same and diverse datasets. The results showed that the RF classifier achieved 96.78% and the LSTM classifier achieved 94.56%, followed by the SVM classifier that achieved 82% accuracy. Further, in terms of F1-score, there is an improvement of 21.38%, 19.16% and 5.2% using RF, LSTM and SVM classifiers compared to the schemes with same dataset.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
AI
AI
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信