Building an Ensemble of Transformer Models for Arabic Dialect Classification and Sentiment Analysis

Abdullah Salem Khered, Ingy Yasser Hassan Abdou Abdelhalim, R. Batista-Navarro
{"title":"Building an Ensemble of Transformer Models for Arabic Dialect Classification and Sentiment Analysis","authors":"Abdullah Salem Khered, Ingy Yasser Hassan Abdou Abdelhalim, R. Batista-Navarro","doi":"10.18653/v1/2022.wanlp-1.53","DOIUrl":null,"url":null,"abstract":"In this paper, we describe the approaches we developed for the Nuanced Arabic Dialect Identification (NADI) 2022 shared task, which consists of two subtasks: the identification of country-level Arabic dialects and sentiment analysis. Our team, UniManc, developed approaches to the two subtasks which are underpinned by the same model: a pre-trained MARBERT language model. For Subtask 1, we applied undersampling to create versions of the training data with a balanced distribution across classes. For Subtask 2, we further trained the original MARBERT model for the masked language modelling objective using a NADI-provided dataset of unlabelled Arabic tweets. For each of the subtasks, a MARBERT model was fine-tuned for sequence classification, using different values for hyperparameters such as seed and learning rate. This resulted in multiple model variants, which formed the basis of an ensemble model for each subtask. Based on the official NADI evaluation, our ensemble model obtained a macro-F1-score of 26.863, ranking second overall in the first subtask. In the second subtask, our ensemble model also ranked second, obtaining a macro-F1-PN score (macro-averaged F1-score over the Positive and Negative classes) of 73.544.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Arabic Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.wanlp-1.53","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

In this paper, we describe the approaches we developed for the Nuanced Arabic Dialect Identification (NADI) 2022 shared task, which consists of two subtasks: the identification of country-level Arabic dialects and sentiment analysis. Our team, UniManc, developed approaches to the two subtasks which are underpinned by the same model: a pre-trained MARBERT language model. For Subtask 1, we applied undersampling to create versions of the training data with a balanced distribution across classes. For Subtask 2, we further trained the original MARBERT model for the masked language modelling objective using a NADI-provided dataset of unlabelled Arabic tweets. For each of the subtasks, a MARBERT model was fine-tuned for sequence classification, using different values for hyperparameters such as seed and learning rate. This resulted in multiple model variants, which formed the basis of an ensemble model for each subtask. Based on the official NADI evaluation, our ensemble model obtained a macro-F1-score of 26.863, ranking second overall in the first subtask. In the second subtask, our ensemble model also ranked second, obtaining a macro-F1-PN score (macro-averaged F1-score over the Positive and Negative classes) of 73.544.
阿拉伯语方言分类与情感分析的变压器模型集成
在本文中,我们描述了我们为细微差别阿拉伯方言识别(NADI) 2022共享任务开发的方法,该任务由两个子任务组成:国家级阿拉伯方言识别和情感分析。我们的团队UniManc开发了两个子任务的方法,这两个子任务的基础是同一个模型:一个预训练的MARBERT语言模型。对于子任务1,我们应用欠采样来创建具有跨类平衡分布的训练数据版本。对于子任务2,我们使用nadi提供的未标记阿拉伯语推文数据集进一步训练了用于屏蔽语言建模目标的原始MARBERT模型。对于每个子任务,使用不同的超参数(如种子和学习率)对MARBERT模型进行微调以进行序列分类。这导致了多个模型变体,形成了每个子任务的集成模型的基础。根据官方的NADI评价,我们的集成模型获得了26.863的宏观f1得分,在第一个子任务中排名第二。在第二个子任务中,我们的集成模型也排名第二,获得了73.544的宏观f1 - pn分数(正类和负类的宏观平均f1分数)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信