Moroccan Data-Driven Spelling Normalization Using Character Neural Embedding

Ridouane Tachicart, Karim Bouzoubaa
{"title":"Moroccan Data-Driven Spelling Normalization Using Character Neural Embedding","authors":"Ridouane Tachicart, Karim Bouzoubaa","doi":"10.1142/s2196888821500044","DOIUrl":null,"url":null,"abstract":"With the increase of Web use in Morocco today, Internet has become an important source of information. Specifically, across social media, the Moroccan people use several languages in their communication leaving behind unstructured user-generated text (UGT) that presents several opportunities for Natural Language Processing. Among the languages found in this data, Moroccan Arabic (MA) stands with an important content and several features. In this paper, we investigate online written text generated by Moroccan users in social media with an emphasis on Moroccan Arabic. For this purpose, we follow several steps, using some tools such as a language identification system, in order to conduct a deep study of this data. The most interesting findings that have emerged are the use of code-switching, multi-script and low amount of words in the Moroccan UGT. Moreover, we used the investigated data in order to build a new Moroccan language resource. The latter consists in building a Moroccan words orthographic variants lexicon following an unsupervised approach and using character neural embedding. This lexicon can be useful for several NLP tasks such as spelling normalization.","PeriodicalId":256649,"journal":{"name":"Vietnam. J. Comput. Sci.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Vietnam. J. Comput. Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s2196888821500044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

With the increase of Web use in Morocco today, Internet has become an important source of information. Specifically, across social media, the Moroccan people use several languages in their communication leaving behind unstructured user-generated text (UGT) that presents several opportunities for Natural Language Processing. Among the languages found in this data, Moroccan Arabic (MA) stands with an important content and several features. In this paper, we investigate online written text generated by Moroccan users in social media with an emphasis on Moroccan Arabic. For this purpose, we follow several steps, using some tools such as a language identification system, in order to conduct a deep study of this data. The most interesting findings that have emerged are the use of code-switching, multi-script and low amount of words in the Moroccan UGT. Moreover, we used the investigated data in order to build a new Moroccan language resource. The latter consists in building a Moroccan words orthographic variants lexicon following an unsupervised approach and using character neural embedding. This lexicon can be useful for several NLP tasks such as spelling normalization.
使用字符神经嵌入的摩洛哥语数据驱动拼写规范化
随着摩洛哥网络使用量的增加,互联网已成为一个重要的信息来源。具体而言,摩洛哥人在社交媒体上使用多种语言进行交流,留下了非结构化的用户生成文本(UGT),为自然语言处理提供了一些机会。在这些数据中发现的语言中,摩洛哥阿拉伯语(MA)具有重要的内容和几个特点。在本文中,我们调查了摩洛哥用户在社交媒体上生成的在线书面文本,重点是摩洛哥阿拉伯语。为此,我们遵循几个步骤,使用一些工具,如语言识别系统,以便对这些数据进行深入研究。最有趣的发现是在摩洛哥语UGT中使用代码转换、多脚本和低字数。此外,我们使用调查的数据,以建立一个新的摩洛哥语资源。后者包括建立一个摩洛哥语单词正字法变体词典遵循一种无监督的方法和使用字符神经嵌入。这个词典对于拼写规范化等一些NLP任务很有用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信