N-gram和Word2Vec特征工程方法在沙特阿拉伯一些有影响力的Twitter话题上的垃圾邮件识别

Proceedings of the 6th International Conference on Information System and Data Mining Pub Date : 2022-05-27 DOI:10.1145/3546157.3546173

Ahmed M. Balfagih, Vlado Keselj, Stacey Taylor

{"title":"N-gram和Word2Vec特征工程方法在沙特阿拉伯一些有影响力的Twitter话题上的垃圾邮件识别","authors":"Ahmed M. Balfagih, Vlado Keselj, Stacey Taylor","doi":"10.1145/3546157.3546173","DOIUrl":null,"url":null,"abstract":"Social media platforms, such as Twitter, have become powerful sources of information on people's perception of major events. Many people use Twitter to express their views on various issues and events and use it to develop their opinion on the diverse economic, political, technical, and social occurrences related to their daily lives. Spam and non-relevant tweets are a major challenge for Twitter trend detection. Saudi Arabia is a top ranked country in Twitter usage worldwide, and in recent years has experienced difficulties due to the use and rise of hashtags based on misleading tweets and spam. The goal of this paper is to apply machine learning techniques to identify spam on the Saudi tweets collected to the end of 2020. To date, spam detection on Twitter data has been mostly done in English, leaving other major languages, such as Arabic, insufficiently covered. Additionally, publicly accessible Arabic Twitter datasets are hard to find. For our research, we use eight Twitter datasets on some significant topics in politics, health, national affairs, economy, and sport, to train and evaluate different machine learning algorithms, with a focus on two feature generation techniques based on N-grams and Word2Vec embeddings. One contribution of this paper is providing these new labelled datasets with embeddings. The experimental results show improvement from using embeddings over N-grams in more balanced datasets vs. more unbalanced ones. We also find a superior performance of the Random Forest algorithm over other algorithms in most experiments.","PeriodicalId":422215,"journal":{"name":"Proceedings of the 6th International Conference on Information System and Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"N-gram and Word2Vec Feature Engineering Approaches for Spam Recognition on Some Influential Twitter Topics in Saudi Arabia\",\"authors\":\"Ahmed M. Balfagih, Vlado Keselj, Stacey Taylor\",\"doi\":\"10.1145/3546157.3546173\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Social media platforms, such as Twitter, have become powerful sources of information on people's perception of major events. Many people use Twitter to express their views on various issues and events and use it to develop their opinion on the diverse economic, political, technical, and social occurrences related to their daily lives. Spam and non-relevant tweets are a major challenge for Twitter trend detection. Saudi Arabia is a top ranked country in Twitter usage worldwide, and in recent years has experienced difficulties due to the use and rise of hashtags based on misleading tweets and spam. The goal of this paper is to apply machine learning techniques to identify spam on the Saudi tweets collected to the end of 2020. To date, spam detection on Twitter data has been mostly done in English, leaving other major languages, such as Arabic, insufficiently covered. Additionally, publicly accessible Arabic Twitter datasets are hard to find. For our research, we use eight Twitter datasets on some significant topics in politics, health, national affairs, economy, and sport, to train and evaluate different machine learning algorithms, with a focus on two feature generation techniques based on N-grams and Word2Vec embeddings. One contribution of this paper is providing these new labelled datasets with embeddings. The experimental results show improvement from using embeddings over N-grams in more balanced datasets vs. more unbalanced ones. We also find a superior performance of the Random Forest algorithm over other algorithms in most experiments.\",\"PeriodicalId\":422215,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Information System and Data Mining\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Information System and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3546157.3546173\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Information System and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546157.3546173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

Twitter等社交媒体平台已经成为人们对重大事件看法的强大信息来源。许多人使用Twitter来表达他们对各种问题和事件的看法，并利用它来发展他们对与日常生活相关的各种经济、政治、技术和社会事件的看法。垃圾邮件和不相关的tweet是Twitter趋势检测的主要挑战。沙特阿拉伯是全球推特使用率最高的国家，近年来，由于基于误导性推文和垃圾邮件的标签的使用和兴起，沙特阿拉伯遇到了困难。本文的目标是应用机器学习技术来识别到2020年底收集的沙特推文中的垃圾邮件。迄今为止，Twitter数据上的垃圾邮件检测主要是用英语完成的，而其他主要语言，如阿拉伯语，没有得到充分的覆盖。此外，很难找到可公开访问的阿拉伯语Twitter数据集。在我们的研究中，我们使用了八个Twitter数据集，涉及政治、卫生、国家事务、经济和体育等一些重要主题，以训练和评估不同的机器学习算法，重点关注基于N-grams和Word2Vec嵌入的两种特征生成技术。本文的一个贡献是为这些新的标记数据集提供嵌入。实验结果表明，在更平衡的数据集上使用n -图的嵌入比在更不平衡的数据集上使用嵌入有改进。在大多数实验中，我们也发现随机森林算法的性能优于其他算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

N-gram and Word2Vec Feature Engineering Approaches for Spam Recognition on Some Influential Twitter Topics in Saudi Arabia

Social media platforms, such as Twitter, have become powerful sources of information on people's perception of major events. Many people use Twitter to express their views on various issues and events and use it to develop their opinion on the diverse economic, political, technical, and social occurrences related to their daily lives. Spam and non-relevant tweets are a major challenge for Twitter trend detection. Saudi Arabia is a top ranked country in Twitter usage worldwide, and in recent years has experienced difficulties due to the use and rise of hashtags based on misleading tweets and spam. The goal of this paper is to apply machine learning techniques to identify spam on the Saudi tweets collected to the end of 2020. To date, spam detection on Twitter data has been mostly done in English, leaving other major languages, such as Arabic, insufficiently covered. Additionally, publicly accessible Arabic Twitter datasets are hard to find. For our research, we use eight Twitter datasets on some significant topics in politics, health, national affairs, economy, and sport, to train and evaluate different machine learning algorithms, with a focus on two feature generation techniques based on N-grams and Word2Vec embeddings. One contribution of this paper is providing these new labelled datasets with embeddings. The experimental results show improvement from using embeddings over N-grams in more balanced datasets vs. more unbalanced ones. We also find a superior performance of the Random Forest algorithm over other algorithms in most experiments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 6th International Conference on Information System and Data Mining

自引率

0.00%

发文量