{"title":"Generate a list of Stop Words in Moroccan Dialect from Social Network Data Using Word Embedding","authors":"Zineb Nassr, N. Sael, F. Benabbou","doi":"10.1109/ICDATA52997.2021.00022","DOIUrl":null,"url":null,"abstract":"Natural Language Processing (NLP) is a branch of artificial intelligence AI that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding. In the information age, using NLP for optimizing information search process, text summary, text, and data analysis systems become the most important. So, to achieve accuracy, redundant words without or with low semantic meaning must be filtered. These words are known as stop words. The Stop words list has been developed for languages like Arabic, English, Chinese, French, etc. But Standard Stop Words list is always missing for dialects, as Moroccan dialect. Manual Identification of stop words for the Moroccan dialect is a difficult task, especially with the diversity of ways that can be used to write a simple stop word. In this work, we propose a novel method for Moroccan dialect stop word generation. To attempt this objective, we first realize preprocessing steps to reduce noise, create stop words dictionary to enrich our database for training purposes and finally use word embedding to build stop words clusters. This list is generated from three popular social networks: Facebook, twitter, and YouTube.","PeriodicalId":231714,"journal":{"name":"2021 International Conference on Digital Age & Technological Advances for Sustainable Development (ICDATA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Digital Age & Technological Advances for Sustainable Development (ICDATA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDATA52997.2021.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Natural Language Processing (NLP) is a branch of artificial intelligence AI that helps computers understand, interpret and manipulate human language. NLP draws from many disciplines, including computer science and computational linguistics, in its pursuit to fill the gap between human communication and computer understanding. In the information age, using NLP for optimizing information search process, text summary, text, and data analysis systems become the most important. So, to achieve accuracy, redundant words without or with low semantic meaning must be filtered. These words are known as stop words. The Stop words list has been developed for languages like Arabic, English, Chinese, French, etc. But Standard Stop Words list is always missing for dialects, as Moroccan dialect. Manual Identification of stop words for the Moroccan dialect is a difficult task, especially with the diversity of ways that can be used to write a simple stop word. In this work, we propose a novel method for Moroccan dialect stop word generation. To attempt this objective, we first realize preprocessing steps to reduce noise, create stop words dictionary to enrich our database for training purposes and finally use word embedding to build stop words clusters. This list is generated from three popular social networks: Facebook, twitter, and YouTube.