Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect

IF 1.8 3区计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Language Resources and Evaluation Pub Date : 2024-09-11 DOI:10.1007/s10579-024-09764-6

Mouad Jbel, Mourad Jabrane, Imad Hafidi, Abdulmutallib Metrane

{"title":"Sentiment analysis dataset in Moroccan dialect: bridging the gap between Arabic and Latin scripted dialect","authors":"Mouad Jbel, Mourad Jabrane, Imad Hafidi, Abdulmutallib Metrane","doi":"10.1007/s10579-024-09764-6","DOIUrl":null,"url":null,"abstract":"<p>Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin characters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 19,991 manually labeled texts in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect as a new contribution to Moroccan Arabic resources. In our exploration of sentiment analysis, we undertook a comprehensive study encompassing various machine-learning models to assess their compatibility with our dataset. While our investigation revealed that the highest accuracy of 98.42% was attained through the utilization of the DarijaBert-mix transfer-learning model, we also delved into deep learning models. Notably, our experimentation yielded a commendable accuracy rate of 92% when employing a CNN model. Furthermore, in an effort to affirm the reliability of our dataset, we tested the CNN model using smaller publicly available datasets of Moroccan dialect, with results that proved to be promising and supportive of our findings.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"6 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09764-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Sentiment analysis, the automated process of determining emotions or opinions expressed in text, has seen extensive exploration in the field of natural language processing. However, one aspect that has remained underrepresented is the sentiment analysis of the Moroccan dialect, which boasts a unique linguistic landscape and the coexistence of multiple scripts. Previous works in sentiment analysis primarily targeted dialects employing Arabic script. While these efforts provided valuable insights, they may not fully capture the complexity of Moroccan web content, which features a blend of Arabic and Latin script. As a result, our study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity. Central to our research is the creation of the largest public dataset for Moroccan dialect sentiment analysis that incorporates not only Moroccan dialect written in Arabic script but also in Latin characters. By assembling a diverse range of textual data, we were able to construct a dataset with a range of 19,991 manually labeled texts in Moroccan dialect and also publicly available lists of stop words in Moroccan dialect as a new contribution to Moroccan Arabic resources. In our exploration of sentiment analysis, we undertook a comprehensive study encompassing various machine-learning models to assess their compatibility with our dataset. While our investigation revealed that the highest accuracy of 98.42% was attained through the utilization of the DarijaBert-mix transfer-learning model, we also delved into deep learning models. Notably, our experimentation yielded a commendable accuracy rate of 92% when employing a CNN model. Furthermore, in an effort to affirm the reliability of our dataset, we tested the CNN model using smaller publicly available datasets of Moroccan dialect, with results that proved to be promising and supportive of our findings.

Abstract Image

查看原文本刊更多论文

摩洛哥方言情感分析数据集：弥合阿拉伯语和拉丁字母方言之间的差距

情感分析是确定文本中表达的情感或观点的自动化过程，在自然语言处理领域有着广泛的探索。然而，摩洛哥方言的情感分析却一直没有得到充分的体现，因为摩洛哥方言具有独特的语言景观，多种文字并存。以前的情感分析工作主要针对使用阿拉伯文字的方言。虽然这些工作提供了有价值的见解，但它们可能无法完全捕捉到摩洛哥网络内容的复杂性，因为摩洛哥网络内容融合了阿拉伯语和拉丁语文字。因此，我们的研究强调了将情感分析扩展到摩洛哥语言多样性整个范围的重要性。我们研究的核心是创建最大的摩洛哥方言情感分析公共数据集，该数据集不仅包含以阿拉伯文字书写的摩洛哥方言，还包含以拉丁字母书写的摩洛哥方言。通过收集各种文本数据，我们构建了一个包含 19991 个人工标注的摩洛哥方言文本的数据集，并公开了摩洛哥方言中的停顿词列表，为摩洛哥阿拉伯语资源做出了新的贡献。在对情感分析的探索中，我们进行了一项包含各种机器学习模型的综合研究，以评估它们与我们的数据集的兼容性。调查显示，通过使用 DarijaBert-mix 转移学习模型，我们获得了 98.42% 的最高准确率，同时我们还深入研究了深度学习模型。值得注意的是，在采用 CNN 模型时，我们的实验取得了令人称道的 92% 的准确率。此外，为了证实我们的数据集的可靠性，我们使用较小的摩洛哥方言公开数据集对 CNN 模型进行了测试，结果证明是有希望的，支持了我们的研究结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Language Resources and Evaluation 工程技术-计算机：跨学科应用

CiteScore

6.50

自引率

3.70%

发文量

审稿时长

>12 weeks

期刊介绍： Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications. Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use. Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.