{"title":"N-gram和Word2Vec特征工程方法在沙特阿拉伯一些有影响力的Twitter话题上的垃圾邮件识别","authors":"Ahmed M. Balfagih, Vlado Keselj, Stacey Taylor","doi":"10.1145/3546157.3546173","DOIUrl":null,"url":null,"abstract":"Social media platforms, such as Twitter, have become powerful sources of information on people's perception of major events. Many people use Twitter to express their views on various issues and events and use it to develop their opinion on the diverse economic, political, technical, and social occurrences related to their daily lives. Spam and non-relevant tweets are a major challenge for Twitter trend detection. Saudi Arabia is a top ranked country in Twitter usage worldwide, and in recent years has experienced difficulties due to the use and rise of hashtags based on misleading tweets and spam. The goal of this paper is to apply machine learning techniques to identify spam on the Saudi tweets collected to the end of 2020. To date, spam detection on Twitter data has been mostly done in English, leaving other major languages, such as Arabic, insufficiently covered. Additionally, publicly accessible Arabic Twitter datasets are hard to find. For our research, we use eight Twitter datasets on some significant topics in politics, health, national affairs, economy, and sport, to train and evaluate different machine learning algorithms, with a focus on two feature generation techniques based on N-grams and Word2Vec embeddings. One contribution of this paper is providing these new labelled datasets with embeddings. The experimental results show improvement from using embeddings over N-grams in more balanced datasets vs. more unbalanced ones. We also find a superior performance of the Random Forest algorithm over other algorithms in most experiments.","PeriodicalId":422215,"journal":{"name":"Proceedings of the 6th International Conference on Information System and Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"N-gram and Word2Vec Feature Engineering Approaches for Spam Recognition on Some Influential Twitter Topics in Saudi Arabia\",\"authors\":\"Ahmed M. Balfagih, Vlado Keselj, Stacey Taylor\",\"doi\":\"10.1145/3546157.3546173\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Social media platforms, such as Twitter, have become powerful sources of information on people's perception of major events. Many people use Twitter to express their views on various issues and events and use it to develop their opinion on the diverse economic, political, technical, and social occurrences related to their daily lives. Spam and non-relevant tweets are a major challenge for Twitter trend detection. Saudi Arabia is a top ranked country in Twitter usage worldwide, and in recent years has experienced difficulties due to the use and rise of hashtags based on misleading tweets and spam. The goal of this paper is to apply machine learning techniques to identify spam on the Saudi tweets collected to the end of 2020. To date, spam detection on Twitter data has been mostly done in English, leaving other major languages, such as Arabic, insufficiently covered. Additionally, publicly accessible Arabic Twitter datasets are hard to find. For our research, we use eight Twitter datasets on some significant topics in politics, health, national affairs, economy, and sport, to train and evaluate different machine learning algorithms, with a focus on two feature generation techniques based on N-grams and Word2Vec embeddings. One contribution of this paper is providing these new labelled datasets with embeddings. The experimental results show improvement from using embeddings over N-grams in more balanced datasets vs. more unbalanced ones. We also find a superior performance of the Random Forest algorithm over other algorithms in most experiments.\",\"PeriodicalId\":422215,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Information System and Data Mining\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Information System and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3546157.3546173\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Information System and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546157.3546173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
N-gram and Word2Vec Feature Engineering Approaches for Spam Recognition on Some Influential Twitter Topics in Saudi Arabia
Social media platforms, such as Twitter, have become powerful sources of information on people's perception of major events. Many people use Twitter to express their views on various issues and events and use it to develop their opinion on the diverse economic, political, technical, and social occurrences related to their daily lives. Spam and non-relevant tweets are a major challenge for Twitter trend detection. Saudi Arabia is a top ranked country in Twitter usage worldwide, and in recent years has experienced difficulties due to the use and rise of hashtags based on misleading tweets and spam. The goal of this paper is to apply machine learning techniques to identify spam on the Saudi tweets collected to the end of 2020. To date, spam detection on Twitter data has been mostly done in English, leaving other major languages, such as Arabic, insufficiently covered. Additionally, publicly accessible Arabic Twitter datasets are hard to find. For our research, we use eight Twitter datasets on some significant topics in politics, health, national affairs, economy, and sport, to train and evaluate different machine learning algorithms, with a focus on two feature generation techniques based on N-grams and Word2Vec embeddings. One contribution of this paper is providing these new labelled datasets with embeddings. The experimental results show improvement from using embeddings over N-grams in more balanced datasets vs. more unbalanced ones. We also find a superior performance of the Random Forest algorithm over other algorithms in most experiments.