用拉丁文字书写的摩洛哥阿拉伯语方言规范化词典

Workshop on Arabic Natural Language Processing Pub Date : 2022-06-18 DOI:10.48550/arXiv.2206.09167

Randa Zarnoufi, H. Jaafar, Walid Bachri, Mounia Abik

{"title":"用拉丁文字书写的摩洛哥阿拉伯语方言规范化词典","authors":"Randa Zarnoufi, H. Jaafar, Walid Bachri, Mounia Abik","doi":"10.48550/arXiv.2206.09167","DOIUrl":null,"url":null,"abstract":"Social media user generated text is actually the main resource for many NLP tasks. This text, however, does not follow the standard rules of writing. Moreover, the use of dialect such as Moroccan Arabic in written communications increases further NLP tasks complexity. A dialect is a verbal language that does not have a standard orthography. The written dialect is based on the phonetic transliteration of spoken words which leads users to improvise spelling while writing. Thus, for the same word we can find multiple forms of transliterations. Subsequently, it is mandatory to normalize these different transliterations to one canonical word form. To reach this goal, we have exploited the powerfulness of word embedding models generated with a corpus of YouTube comments. Besides, using a Moroccan Arabic dialect dictionary that provides the canonical forms, we have built a normalization dictionary that we refer to as MANorm. We have conducted several experiments to demonstrate the efficiency of MANorm, which have shown its usefulness in dialect normalization. We made MANorm freely available online.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script\",\"authors\":\"Randa Zarnoufi, H. Jaafar, Walid Bachri, Mounia Abik\",\"doi\":\"10.48550/arXiv.2206.09167\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Social media user generated text is actually the main resource for many NLP tasks. This text, however, does not follow the standard rules of writing. Moreover, the use of dialect such as Moroccan Arabic in written communications increases further NLP tasks complexity. A dialect is a verbal language that does not have a standard orthography. The written dialect is based on the phonetic transliteration of spoken words which leads users to improvise spelling while writing. Thus, for the same word we can find multiple forms of transliterations. Subsequently, it is mandatory to normalize these different transliterations to one canonical word form. To reach this goal, we have exploited the powerfulness of word embedding models generated with a corpus of YouTube comments. Besides, using a Moroccan Arabic dialect dictionary that provides the canonical forms, we have built a normalization dictionary that we refer to as MANorm. We have conducted several experiments to demonstrate the efficiency of MANorm, which have shown its usefulness in dialect normalization. We made MANorm freely available online.\",\"PeriodicalId\":355149,\"journal\":{\"name\":\"Workshop on Arabic Natural Language Processing\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Arabic Natural Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2206.09167\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Arabic Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2206.09167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

社交媒体用户生成的文本实际上是许多NLP任务的主要资源。然而，这篇文章并没有遵循标准的写作规则。此外，在书面交流中使用方言，如摩洛哥阿拉伯语，进一步增加了NLP任务的复杂性。方言是一种没有标准拼写法的口头语言。书面方言是基于口语单词的语音音译，这使得使用者在写作时即兴拼写。因此，对于同一个单词，我们可以找到多种形式的音译。随后，必须将这些不同的音译规范为一个规范的单词形式。为了达到这个目标，我们利用了由YouTube评论语料库生成的强大的词嵌入模型。此外，使用提供规范形式的摩洛哥阿拉伯语方言字典，我们已经构建了一个规范化字典，我们称之为MANorm。我们已经进行了几个实验来证明MANorm的有效性，这些实验表明了它在方言规范化方面的有效性。我们在网上免费提供了MANorm。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script

Social media user generated text is actually the main resource for many NLP tasks. This text, however, does not follow the standard rules of writing. Moreover, the use of dialect such as Moroccan Arabic in written communications increases further NLP tasks complexity. A dialect is a verbal language that does not have a standard orthography. The written dialect is based on the phonetic transliteration of spoken words which leads users to improvise spelling while writing. Thus, for the same word we can find multiple forms of transliterations. Subsequently, it is mandatory to normalize these different transliterations to one canonical word form. To reach this goal, we have exploited the powerfulness of word embedding models generated with a corpus of YouTube comments. Besides, using a Moroccan Arabic dialect dictionary that provides the canonical forms, we have built a normalization dictionary that we refer to as MANorm. We have conducted several experiments to demonstrate the efficiency of MANorm, which have shown its usefulness in dialect normalization. We made MANorm freely available online.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Arabic Natural Language Processing

自引率

0.00%

发文量