N-gram and Word2Vec Feature Engineering Approaches for Spam Recognition on Some Influential Twitter Topics in Saudi Arabia

Ahmed M. Balfagih, Vlado Keselj, Stacey Taylor
{"title":"N-gram and Word2Vec Feature Engineering Approaches for Spam Recognition on Some Influential Twitter Topics in Saudi Arabia","authors":"Ahmed M. Balfagih, Vlado Keselj, Stacey Taylor","doi":"10.1145/3546157.3546173","DOIUrl":null,"url":null,"abstract":"Social media platforms, such as Twitter, have become powerful sources of information on people's perception of major events. Many people use Twitter to express their views on various issues and events and use it to develop their opinion on the diverse economic, political, technical, and social occurrences related to their daily lives. Spam and non-relevant tweets are a major challenge for Twitter trend detection. Saudi Arabia is a top ranked country in Twitter usage worldwide, and in recent years has experienced difficulties due to the use and rise of hashtags based on misleading tweets and spam. The goal of this paper is to apply machine learning techniques to identify spam on the Saudi tweets collected to the end of 2020. To date, spam detection on Twitter data has been mostly done in English, leaving other major languages, such as Arabic, insufficiently covered. Additionally, publicly accessible Arabic Twitter datasets are hard to find. For our research, we use eight Twitter datasets on some significant topics in politics, health, national affairs, economy, and sport, to train and evaluate different machine learning algorithms, with a focus on two feature generation techniques based on N-grams and Word2Vec embeddings. One contribution of this paper is providing these new labelled datasets with embeddings. The experimental results show improvement from using embeddings over N-grams in more balanced datasets vs. more unbalanced ones. We also find a superior performance of the Random Forest algorithm over other algorithms in most experiments.","PeriodicalId":422215,"journal":{"name":"Proceedings of the 6th International Conference on Information System and Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Information System and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546157.3546173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Social media platforms, such as Twitter, have become powerful sources of information on people's perception of major events. Many people use Twitter to express their views on various issues and events and use it to develop their opinion on the diverse economic, political, technical, and social occurrences related to their daily lives. Spam and non-relevant tweets are a major challenge for Twitter trend detection. Saudi Arabia is a top ranked country in Twitter usage worldwide, and in recent years has experienced difficulties due to the use and rise of hashtags based on misleading tweets and spam. The goal of this paper is to apply machine learning techniques to identify spam on the Saudi tweets collected to the end of 2020. To date, spam detection on Twitter data has been mostly done in English, leaving other major languages, such as Arabic, insufficiently covered. Additionally, publicly accessible Arabic Twitter datasets are hard to find. For our research, we use eight Twitter datasets on some significant topics in politics, health, national affairs, economy, and sport, to train and evaluate different machine learning algorithms, with a focus on two feature generation techniques based on N-grams and Word2Vec embeddings. One contribution of this paper is providing these new labelled datasets with embeddings. The experimental results show improvement from using embeddings over N-grams in more balanced datasets vs. more unbalanced ones. We also find a superior performance of the Random Forest algorithm over other algorithms in most experiments.
N-gram和Word2Vec特征工程方法在沙特阿拉伯一些有影响力的Twitter话题上的垃圾邮件识别
Twitter等社交媒体平台已经成为人们对重大事件看法的强大信息来源。许多人使用Twitter来表达他们对各种问题和事件的看法,并利用它来发展他们对与日常生活相关的各种经济、政治、技术和社会事件的看法。垃圾邮件和不相关的tweet是Twitter趋势检测的主要挑战。沙特阿拉伯是全球推特使用率最高的国家,近年来,由于基于误导性推文和垃圾邮件的标签的使用和兴起,沙特阿拉伯遇到了困难。本文的目标是应用机器学习技术来识别到2020年底收集的沙特推文中的垃圾邮件。迄今为止,Twitter数据上的垃圾邮件检测主要是用英语完成的,而其他主要语言,如阿拉伯语,没有得到充分的覆盖。此外,很难找到可公开访问的阿拉伯语Twitter数据集。在我们的研究中,我们使用了八个Twitter数据集,涉及政治、卫生、国家事务、经济和体育等一些重要主题,以训练和评估不同的机器学习算法,重点关注基于N-grams和Word2Vec嵌入的两种特征生成技术。本文的一个贡献是为这些新的标记数据集提供嵌入。实验结果表明,在更平衡的数据集上使用n -图的嵌入比在更不平衡的数据集上使用嵌入有改进。在大多数实验中,我们也发现随机森林算法的性能优于其他算法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信