MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script

Randa Zarnoufi, H. Jaafar, Walid Bachri, Mounia Abik
{"title":"MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script","authors":"Randa Zarnoufi, H. Jaafar, Walid Bachri, Mounia Abik","doi":"10.48550/arXiv.2206.09167","DOIUrl":null,"url":null,"abstract":"Social media user generated text is actually the main resource for many NLP tasks. This text, however, does not follow the standard rules of writing. Moreover, the use of dialect such as Moroccan Arabic in written communications increases further NLP tasks complexity. A dialect is a verbal language that does not have a standard orthography. The written dialect is based on the phonetic transliteration of spoken words which leads users to improvise spelling while writing. Thus, for the same word we can find multiple forms of transliterations. Subsequently, it is mandatory to normalize these different transliterations to one canonical word form. To reach this goal, we have exploited the powerfulness of word embedding models generated with a corpus of YouTube comments. Besides, using a Moroccan Arabic dialect dictionary that provides the canonical forms, we have built a normalization dictionary that we refer to as MANorm. We have conducted several experiments to demonstrate the efficiency of MANorm, which have shown its usefulness in dialect normalization. We made MANorm freely available online.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Arabic Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2206.09167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Social media user generated text is actually the main resource for many NLP tasks. This text, however, does not follow the standard rules of writing. Moreover, the use of dialect such as Moroccan Arabic in written communications increases further NLP tasks complexity. A dialect is a verbal language that does not have a standard orthography. The written dialect is based on the phonetic transliteration of spoken words which leads users to improvise spelling while writing. Thus, for the same word we can find multiple forms of transliterations. Subsequently, it is mandatory to normalize these different transliterations to one canonical word form. To reach this goal, we have exploited the powerfulness of word embedding models generated with a corpus of YouTube comments. Besides, using a Moroccan Arabic dialect dictionary that provides the canonical forms, we have built a normalization dictionary that we refer to as MANorm. We have conducted several experiments to demonstrate the efficiency of MANorm, which have shown its usefulness in dialect normalization. We made MANorm freely available online.
用拉丁文字书写的摩洛哥阿拉伯语方言规范化词典
社交媒体用户生成的文本实际上是许多NLP任务的主要资源。然而,这篇文章并没有遵循标准的写作规则。此外,在书面交流中使用方言,如摩洛哥阿拉伯语,进一步增加了NLP任务的复杂性。方言是一种没有标准拼写法的口头语言。书面方言是基于口语单词的语音音译,这使得使用者在写作时即兴拼写。因此,对于同一个单词,我们可以找到多种形式的音译。随后,必须将这些不同的音译规范为一个规范的单词形式。为了达到这个目标,我们利用了由YouTube评论语料库生成的强大的词嵌入模型。此外,使用提供规范形式的摩洛哥阿拉伯语方言字典,我们已经构建了一个规范化字典,我们称之为MANorm。我们已经进行了几个实验来证明MANorm的有效性,这些实验表明了它在方言规范化方面的有效性。我们在网上免费提供了MANorm。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信