Workshop on Arabic Natural Language Processing最新文献

筛选
英文 中文
On The Arabic Dialects’ Identification: Overcoming Challenges of Geographical Similarities Between Arabic dialects and Imbalanced Datasets 阿拉伯方言识别:克服阿拉伯方言地理相似性和不平衡数据集的挑战
Workshop on Arabic Natural Language Processing Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.wanlp-1.49
Salma Jamal, Aly M. Kassem, Omar Mohamed, Ali Ashraf
{"title":"On The Arabic Dialects’ Identification: Overcoming Challenges of Geographical Similarities Between Arabic dialects and Imbalanced Datasets","authors":"Salma Jamal, Aly M. Kassem, Omar Mohamed, Ali Ashraf","doi":"10.18653/v1/2022.wanlp-1.49","DOIUrl":"https://doi.org/10.18653/v1/2022.wanlp-1.49","url":null,"abstract":"Arabic is one of the world’s richest languages, with a diverse range of dialects based on geographical origin. In this paper, we present a solution to tackle subtask 1 (Country-level dialect identification) of the Nuanced Arabic Dialect Identification (NADI) shared task 2022 achieving third place with an average macro F1 score between the two test sets of 26.44%. In the preprocessing stage, we removed the most common frequent terms from all sentences across all dialects, and in the modeling step, we employed a hybrid loss function approach that includes Weighted cross entropy loss and Vector Scaling(VS) Loss. On test sets A and B, our model achieved 35.68% and 17.192% Macro F1 scores, respectively.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124402351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Word Representation Models for Arabic Dialect Identification 阿拉伯语方言识别的词表示模型
Workshop on Arabic Natural Language Processing Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.wanlp-1.52
M. Sobhy, Ahmed H. Abu El-Atta, A. El-sawy, Hamada Nayel
{"title":"Word Representation Models for Arabic Dialect Identification","authors":"M. Sobhy, Ahmed H. Abu El-Atta, A. El-sawy, Hamada Nayel","doi":"10.18653/v1/2022.wanlp-1.52","DOIUrl":"https://doi.org/10.18653/v1/2022.wanlp-1.52","url":null,"abstract":"This paper describes the systems submitted by BFCAI team to Nuanced Arabic Dialect Identification (NADI) shared task 2022. Dialect identification task aims at detecting the source variant of a given text or speech segment automatically. There are two subtasks in NADI 2022, the first subtask for country-level identification and the second subtask for sentiment analysis. Our team participated in the first subtask. The proposed systems use Term Frequency Inverse/Document Frequency and word embeddings as vectorization models. Different machine learning algorithms have been used as classifiers. The proposed systems have been tested on two test sets: Test-A and Test-B. The proposed models achieved Macro-f1 score of 21.25% and 9.71% for Test-A and Test-B set respectively. On other hand, the best-performed submitted system achieved Macro-f1 score of 36.48% and 18.95% for Test-A and Test-B set respectively.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132346320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Arabic Dialect Identification and Sentiment Classification using Transformer-based Models 基于变换模型的阿拉伯语方言识别与情感分类
Workshop on Arabic Natural Language Processing Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.wanlp-1.54
Joseph Attieh, Fadi Hassan
{"title":"Arabic Dialect Identification and Sentiment Classification using Transformer-based Models","authors":"Joseph Attieh, Fadi Hassan","doi":"10.18653/v1/2022.wanlp-1.54","DOIUrl":"https://doi.org/10.18653/v1/2022.wanlp-1.54","url":null,"abstract":"In this paper, we present two deep learning approaches that are based on AraBERT, submitted to the Nuanced Arabic Dialect Identification (NADI) shared task of the Seventh Workshop for Arabic Natural Language Processing (WANLP 2022). NADI consists of two main sub-tasks, mainly country-level dialect and sentiment identification for dialectical Arabic. We present one system per sub-task. The first system is a multi-task learning model that consists of a shared AraBERT encoder with three task-specific classification layers. This model is trained to jointly learn the country-level dialect of the tweet as well as the region-level and area-level dialects. The second system is a distilled model of an ensemble of models trained using K-fold cross-validation. Each model in the ensemble consists of an AraBERT model and a classifier, fine-tuned on (K-1) folds of the training set. Our team Pythoneers achieved rank 6 on the first test set of the first sub-task, rank 9 on the second test set of the first sub-task, and rank 4 on the test set of the second sub-task.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123831388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Learning From Arabic Corpora But Not Always From Arabic Speakers: A Case Study of the Arabic Wikipedia Editions 从阿拉伯语料库中学习,但并不总是从阿拉伯语使用者那里学习:阿拉伯语维基百科版本的案例研究
Workshop on Arabic Natural Language Processing Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.wanlp-1.34
Saied Alshahrani, Esma Wali, Jeanna Neefe Matthews
{"title":"Learning From Arabic Corpora But Not Always From Arabic Speakers: A Case Study of the Arabic Wikipedia Editions","authors":"Saied Alshahrani, Esma Wali, Jeanna Neefe Matthews","doi":"10.18653/v1/2022.wanlp-1.34","DOIUrl":"https://doi.org/10.18653/v1/2022.wanlp-1.34","url":null,"abstract":"Wikipedia is a common source of training data for Natural Language Processing (NLP) research, especially as a source for corpora in languages other than English. However, for many downstream NLP tasks, it is important to understand the degree to which these corpora reflect representative contributions of native speakers. In particular, many entries in a given language may be translated from other languages or produced through other automated mechanisms. Language models built using corpora like Wikipedia can embed history, culture, bias, stereotypes, politics, and more, but it is important to understand whose views are actually being represented. In this paper, we present a case study focusing specifically on differences among the Arabic Wikipedia editions (Modern Standard Arabic, Egyptian, and Moroccan). In particular, we document issues in the Egyptian Arabic Wikipedia with automatic creation/generation and translation of content pages from English without human supervision. These issues could substantially affect the performance and accuracy of Large Language Models (LLMs) trained from these corpora, producing models that lack the cultural richness and meaningful representation of native speakers. Fortunately, the metadata maintained by Wikipedia provides visibility into these issues, but unfortunately, this is not the case for all corpora used to train LLMs.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"4 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127612819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The Effect of Arabic Dialect Familiarity on Data Annotation 阿拉伯语方言熟悉度对数据标注的影响
Workshop on Arabic Natural Language Processing Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.wanlp-1.39
Ibrahim Abu Farha, Walid Magdy
{"title":"The Effect of Arabic Dialect Familiarity on Data Annotation","authors":"Ibrahim Abu Farha, Walid Magdy","doi":"10.18653/v1/2022.wanlp-1.39","DOIUrl":"https://doi.org/10.18653/v1/2022.wanlp-1.39","url":null,"abstract":"Data annotation is the foundation of most natural language processing (NLP) tasks. However, data annotation is complex and there is often no specific correct label, especially in subjective tasks. Data annotation is affected by the annotators’ ability to understand the provided data. In the case of Arabic, this is important due to the large dialectal variety. In this paper, we analyse how Arabic speakers understand other dialects in written text. Also, we analyse the effect of dialect familiarity on the quality of data annotation, focusing on Arabic sarcasm detection. This is done by collecting third-party labels and comparing them to high-quality first-party labels. Our analysis shows that annotators tend to better identify their own dialect and they are prone to confuse dialects they are unfamiliar with. For task labels, annotators tend to perform better on their dialect or dialects they are familiar with. Finally, females tend to perform better than males on the sarcasm detection task. We suggest that to guarantee high-quality labels, researchers should recruit native dialect speakers for annotation.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115568832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Benchmarking transfer learning approaches for sentiment analysis of Arabic dialect 阿拉伯语方言情感分析的标杆迁移学习方法
Workshop on Arabic Natural Language Processing Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.wanlp-1.44
Emna Fsih, Saméh Kchaou, Rahma Boujelbane, Lamia Hadrich Belguith
{"title":"Benchmarking transfer learning approaches for sentiment analysis of Arabic dialect","authors":"Emna Fsih, Saméh Kchaou, Rahma Boujelbane, Lamia Hadrich Belguith","doi":"10.18653/v1/2022.wanlp-1.44","DOIUrl":"https://doi.org/10.18653/v1/2022.wanlp-1.44","url":null,"abstract":"Arabic has a widely varying collection of dialects. With the explosion of the use of social networks, the volume of written texts has remarkably increased. Most users express themselves using their own dialect. Unfortunately, many of these dialects remain under-studied due to the scarcity of resources. Researchers and industry practitioners are increasingly interested in analyzing users’ sentiments. In this context, several approaches have been proposed, namely: traditional machine learning, deep learning transfer learning and more recently few-shot learning approaches. In this work, we compare their efficiency as part of the NADI competition to develop a country-level sentiment analysis model. Three models were beneficial for this sub-task: The first based on Sentence Transformer (ST) and achieve 43.23% on DEV set and 42.33% on TEST set, the second based on CAMeLBERT and achieve 47.85% on DEV set and 41.72% on TEST set and the third based on multi-dialect BERT model and achieve 66.72% on DEV set and 39.69% on TEST set.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129605683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
SQU-CS @ NADI 2022: Dialectal Arabic Identification using One-vs-One Classification with TF-IDF Weights Computed on Character n-grams sq - cs @ NADI 2022:基于字符n-图计算TF-IDF权重的一对一分类阿拉伯方言识别
Workshop on Arabic Natural Language Processing Pub Date : 1900-01-01 DOI: 10.18653/v1/2022.wanlp-1.45
A. AAlAbdulsalam
{"title":"SQU-CS @ NADI 2022: Dialectal Arabic Identification using One-vs-One Classification with TF-IDF Weights Computed on Character n-grams","authors":"A. AAlAbdulsalam","doi":"10.18653/v1/2022.wanlp-1.45","DOIUrl":"https://doi.org/10.18653/v1/2022.wanlp-1.45","url":null,"abstract":"In this paper, I present an approach using one-vs-one classification scheme with TF-IDF term weighting on character n-grams for identifying Arabic dialects used in social media. The scheme was evaluated in the context of the third Nuanced Arabic Dialect Identification (NADI 2022) shared task for identifying Arabic dialects used in Twitter messages. The approach was implemented with logistic regression loss and trained using stochastic gradient decent (SGD) algorithm. This simple method achieved a macro F1 score of 22.89% and 10.83% on TEST A and TEST B, respectively, in comparison to an approach based on AraBERT pretrained transformer model which achieved a macro F1 score of 30.01% and 14.84%, respectively. My submission based on AraBERT scored a macro F1 average of 22.42% and was ranked 10 out of the 19 teams who participated in the task.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115297016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信