社交媒体中的阿拉伯语方言识别

2020 3rd International Conference on Computer Applications & Information Security (ICCAIS) Pub Date : 2020-03-01 DOI:10.1109/ICCAIS48893.2020.9096847

Reem AlYami, Rabeah Alzaidy

{"title":"社交媒体中的阿拉伯语方言识别","authors":"Reem AlYami, Rabeah Alzaidy","doi":"10.1109/ICCAIS48893.2020.9096847","DOIUrl":null,"url":null,"abstract":"PURPOSE/AIM & BACKGROUNDAlthough the Arabic language is spoken in twenty-two countries by more than 250 million speakers, it is still considered by Natural Language Processing NLP practitioners as a low resource language. Formal sources of Arabic texts are typically written in Modern Standard (or Written) Arabic (MSA), which is a form that is used in formal writing and taught in schools to Arabic speakers. However, informal communication among Arabic speakers is through informal local diglossic dialects. A diglossic language is one where the speakers of the same language have varying dialects. In Arabic, there are multiple dialects in different regions of the Arab world: Gulf, Levantine and North Africa. Users commonly communicate in social media using their local dialect rather than the formal MSA. This introduces a core NLP problem for Arabic, which is dialect identification. It is essential to identify the specific dialect prior to performing tasks such as parsing, tokenizing and other downstream tasks such as semantic inferences. Processing massive amounts of data written in these local dialects requires this identification step to improve accuracies, especially for automatic text comprehension tasks. Although Arabic dialects share a majority of common words, it is not uncommon for the same word to have different meanings across dialects. In addition to improving NLP task accuracies, Arabic Dialect Identification ADI enables a finer-grained demographic identification for mining texts related to consumer reports, health forums, entertainment and tourism reviews, and many others which ultimately lead to improved services for each demographic.The problem of ADI has been addressed by several studies such as (Al-Walaie & Khan, 2017), and (Harrat et al., 2019). Some works focus mainly on curating data sets for the problem such as the Sham dataset proposed by (Abu Kwaik et al., 2018).In this work we focus on both tasks: we curate an Arabic dialect dataset for two variants of Arabic (Saudi Arabian and Egyptian) and we train supervised machine learning models to address the identification task.","PeriodicalId":422184,"journal":{"name":"2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Arabic Dialect Identification in Social Media\",\"authors\":\"Reem AlYami, Rabeah Alzaidy\",\"doi\":\"10.1109/ICCAIS48893.2020.9096847\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"PURPOSE/AIM & BACKGROUNDAlthough the Arabic language is spoken in twenty-two countries by more than 250 million speakers, it is still considered by Natural Language Processing NLP practitioners as a low resource language. Formal sources of Arabic texts are typically written in Modern Standard (or Written) Arabic (MSA), which is a form that is used in formal writing and taught in schools to Arabic speakers. However, informal communication among Arabic speakers is through informal local diglossic dialects. A diglossic language is one where the speakers of the same language have varying dialects. In Arabic, there are multiple dialects in different regions of the Arab world: Gulf, Levantine and North Africa. Users commonly communicate in social media using their local dialect rather than the formal MSA. This introduces a core NLP problem for Arabic, which is dialect identification. It is essential to identify the specific dialect prior to performing tasks such as parsing, tokenizing and other downstream tasks such as semantic inferences. Processing massive amounts of data written in these local dialects requires this identification step to improve accuracies, especially for automatic text comprehension tasks. Although Arabic dialects share a majority of common words, it is not uncommon for the same word to have different meanings across dialects. In addition to improving NLP task accuracies, Arabic Dialect Identification ADI enables a finer-grained demographic identification for mining texts related to consumer reports, health forums, entertainment and tourism reviews, and many others which ultimately lead to improved services for each demographic.The problem of ADI has been addressed by several studies such as (Al-Walaie & Khan, 2017), and (Harrat et al., 2019). Some works focus mainly on curating data sets for the problem such as the Sham dataset proposed by (Abu Kwaik et al., 2018).In this work we focus on both tasks: we curate an Arabic dialect dataset for two variants of Arabic (Saudi Arabian and Egyptian) and we train supervised machine learning models to address the identification task.\",\"PeriodicalId\":422184,\"journal\":{\"name\":\"2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCAIS48893.2020.9096847\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAIS48893.2020.9096847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

目的/目的&背景虽然阿拉伯语在22个国家有超过2.5亿的使用者，但它仍然被自然语言处理NLP从业者认为是一种低资源语言。阿拉伯文本的正式来源通常是用现代标准(或书面)阿拉伯语(MSA)写成的，这是一种用于正式写作并在学校教授阿拉伯语的形式。然而，阿拉伯语使用者之间的非正式交流是通过非正式的当地方言进行的。双语语言是说同一种语言的人有不同的方言。阿拉伯语在阿拉伯世界的不同地区有多种方言:海湾、黎凡特和北非。用户通常在社交媒体上使用当地方言而不是正式的MSA进行交流。这就引入了阿拉伯语的一个核心NLP问题，即方言识别。在执行诸如解析、标记化和其他下游任务(如语义推理)之前，识别特定的方言是至关重要的。处理用这些地方方言编写的大量数据需要这个识别步骤来提高准确性，特别是对于自动文本理解任务。虽然阿拉伯语方言共享大部分常用词，但同一个词在不同方言中有不同的意思并不罕见。除了提高NLP任务的准确性外，阿拉伯方言识别ADI还可以对挖掘与消费者报告、健康论坛、娱乐和旅游评论等相关的文本进行更细粒度的人口统计识别，最终为每个人口统计提供更好的服务。一些研究已经解决了ADI的问题，如(al - walaie & Khan, 2017)和(Harrat et al.， 2019)。一些工作主要集中在为问题策划数据集，例如(Abu Kwaik等人，2018)提出的Sham数据集。在这项工作中，我们专注于两个任务:我们为阿拉伯语的两个变体(沙特阿拉伯语和埃及语)策划了一个阿拉伯方言数据集，我们训练有监督的机器学习模型来解决识别任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Arabic Dialect Identification in Social Media

PURPOSE/AIM & BACKGROUNDAlthough the Arabic language is spoken in twenty-two countries by more than 250 million speakers, it is still considered by Natural Language Processing NLP practitioners as a low resource language. Formal sources of Arabic texts are typically written in Modern Standard (or Written) Arabic (MSA), which is a form that is used in formal writing and taught in schools to Arabic speakers. However, informal communication among Arabic speakers is through informal local diglossic dialects. A diglossic language is one where the speakers of the same language have varying dialects. In Arabic, there are multiple dialects in different regions of the Arab world: Gulf, Levantine and North Africa. Users commonly communicate in social media using their local dialect rather than the formal MSA. This introduces a core NLP problem for Arabic, which is dialect identification. It is essential to identify the specific dialect prior to performing tasks such as parsing, tokenizing and other downstream tasks such as semantic inferences. Processing massive amounts of data written in these local dialects requires this identification step to improve accuracies, especially for automatic text comprehension tasks. Although Arabic dialects share a majority of common words, it is not uncommon for the same word to have different meanings across dialects. In addition to improving NLP task accuracies, Arabic Dialect Identification ADI enables a finer-grained demographic identification for mining texts related to consumer reports, health forums, entertainment and tourism reviews, and many others which ultimately lead to improved services for each demographic.The problem of ADI has been addressed by several studies such as (Al-Walaie & Khan, 2017), and (Harrat et al., 2019). Some works focus mainly on curating data sets for the problem such as the Sham dataset proposed by (Abu Kwaik et al., 2018).In this work we focus on both tasks: we curate an Arabic dialect dataset for two variants of Arabic (Saudi Arabian and Egyptian) and we train supervised machine learning models to address the identification task.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 3rd International Conference on Computer Applications & Information Security (ICCAIS)

自引率

0.00%

发文量