用于阿拉伯语方言识别的深层双向变压器

Amal Alghamdi, Areej Alshutayri, Basma Alharbi
{"title":"用于阿拉伯语方言识别的深层双向变压器","authors":"Amal Alghamdi, Areej Alshutayri, Basma Alharbi","doi":"10.1145/3584202.3584243","DOIUrl":null,"url":null,"abstract":"The rising adoption of social media has led to the widespread dissemination of online textual data. Arabic is among the top five most popular languages worldwide (Arabic is spoken by a total of about 360.2 million people worldwide as a native language). In this regard, Arabic-text data available on social media are presented using different Arabic dialects, such as the Gulf, Iraqi, Egyptian, Levantine, and North Africa dialects. Particularly, identifying the Arabic dialect used in text is of significant value for several natural language processing tasks, such as machine translation, text generation, word correction, and information retrieval. Arabic-dialect identification is a multiclass classification problem in which classes represent different Arabic dialects. In this study, we investigated the performance of two bidirectional deep learning models for Arabic-dialect classification: MARBERT and ARBERT. We analyzed the performance of the models on two publicly available datasets: the Arabic Online Commentary dataset and the Social Media Arabic Dialect Corpus. Extensive experiments were conducted, encompassing binary dialect classification, three-way dialect classification, and multi-way dialect classification. The results indicate that MARBERT consistently achieved higher F1-scores than ARBERT, which can be attributed to the significant differences between the two models, including their architectures, training mechanisms, and data sources.","PeriodicalId":438341,"journal":{"name":"Proceedings of the 6th International Conference on Future Networks & Distributed Systems","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep Bidirectional Transformers for Arabic Dialect Identification\",\"authors\":\"Amal Alghamdi, Areej Alshutayri, Basma Alharbi\",\"doi\":\"10.1145/3584202.3584243\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rising adoption of social media has led to the widespread dissemination of online textual data. Arabic is among the top five most popular languages worldwide (Arabic is spoken by a total of about 360.2 million people worldwide as a native language). In this regard, Arabic-text data available on social media are presented using different Arabic dialects, such as the Gulf, Iraqi, Egyptian, Levantine, and North Africa dialects. Particularly, identifying the Arabic dialect used in text is of significant value for several natural language processing tasks, such as machine translation, text generation, word correction, and information retrieval. Arabic-dialect identification is a multiclass classification problem in which classes represent different Arabic dialects. In this study, we investigated the performance of two bidirectional deep learning models for Arabic-dialect classification: MARBERT and ARBERT. We analyzed the performance of the models on two publicly available datasets: the Arabic Online Commentary dataset and the Social Media Arabic Dialect Corpus. Extensive experiments were conducted, encompassing binary dialect classification, three-way dialect classification, and multi-way dialect classification. The results indicate that MARBERT consistently achieved higher F1-scores than ARBERT, which can be attributed to the significant differences between the two models, including their architectures, training mechanisms, and data sources.\",\"PeriodicalId\":438341,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on Future Networks & Distributed Systems\",\"volume\":\"72 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on Future Networks & Distributed Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3584202.3584243\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Future Networks & Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3584202.3584243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

社交媒体的日益普及导致了在线文本数据的广泛传播。阿拉伯语是世界上最受欢迎的五种语言之一(全世界共有约3.602亿人将阿拉伯语作为母语)。在这方面,社交媒体上可用的阿拉伯语文本数据使用不同的阿拉伯语方言呈现,例如海湾、伊拉克、埃及、黎凡特和北非方言。特别是,识别文本中使用的阿拉伯语方言对于一些自然语言处理任务具有重要价值,例如机器翻译,文本生成,单词校正和信息检索。阿拉伯方言识别是一个多类分类问题,其中类代表不同的阿拉伯方言。在这项研究中,我们研究了两种用于阿拉伯语方言分类的双向深度学习模型:MARBERT和ARBERT的性能。我们在两个公开可用的数据集上分析了模型的性能:阿拉伯语在线评论数据集和社交媒体阿拉伯语方言语料库。进行了大量的实验,包括二元方言分类、三向方言分类和多向方言分类。结果表明,MARBERT始终比ARBERT获得更高的f1分数,这可以归因于两个模型之间的显著差异,包括他们的架构,训练机制和数据源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Deep Bidirectional Transformers for Arabic Dialect Identification
The rising adoption of social media has led to the widespread dissemination of online textual data. Arabic is among the top five most popular languages worldwide (Arabic is spoken by a total of about 360.2 million people worldwide as a native language). In this regard, Arabic-text data available on social media are presented using different Arabic dialects, such as the Gulf, Iraqi, Egyptian, Levantine, and North Africa dialects. Particularly, identifying the Arabic dialect used in text is of significant value for several natural language processing tasks, such as machine translation, text generation, word correction, and information retrieval. Arabic-dialect identification is a multiclass classification problem in which classes represent different Arabic dialects. In this study, we investigated the performance of two bidirectional deep learning models for Arabic-dialect classification: MARBERT and ARBERT. We analyzed the performance of the models on two publicly available datasets: the Arabic Online Commentary dataset and the Social Media Arabic Dialect Corpus. Extensive experiments were conducted, encompassing binary dialect classification, three-way dialect classification, and multi-way dialect classification. The results indicate that MARBERT consistently achieved higher F1-scores than ARBERT, which can be attributed to the significant differences between the two models, including their architectures, training mechanisms, and data sources.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信