A Weak Supervised Transfer Learning Approach for Sentiment Analysis to the Kuwaiti Dialect

Workshop on Arabic Natural Language Processing Pub Date : 1900-01-01 DOI:10.18653/v1/2022.wanlp-1.15

Fatemah Husain, Hana Al-Ostad, Halima Omar

{"title":"A Weak Supervised Transfer Learning Approach for Sentiment Analysis to the Kuwaiti Dialect","authors":"Fatemah Husain, Hana Al-Ostad, Halima Omar","doi":"10.18653/v1/2022.wanlp-1.15","DOIUrl":null,"url":null,"abstract":"Developing a system for sentiment analysis is very challenging for the Arabic language due to the limitations in the available Arabic datasets. Many Arabic dialects are still not studied by researchers in Arabic sentiment analysis due to the complexity of annotators’ recruitment process during dataset creation. This paper covers the research gap in sentiment analysis for the Kuwaiti dialect by proposing a weak supervised approach to develop a large labeled dataset. Our dataset consists of over 16.6k tweets with 7,905 negatives, 7,902 positives, and 860 neutrals that spans several themes and time frames to remove any bias that might affect its content. The annotation agreement between our proposed system’s labels and human-annotated labels reports 93% for the pairwise percent agreement and 0.87 for Cohen’s kappa coefficient. Furthermore, we evaluate our dataset using multiple traditional machine learning classifiers and advanced deep learning language models to test its performance. The results report 89% accuracy when applied to the testing dataset using the ARBERT model.","PeriodicalId":355149,"journal":{"name":"Workshop on Arabic Natural Language Processing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Arabic Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.wanlp-1.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Developing a system for sentiment analysis is very challenging for the Arabic language due to the limitations in the available Arabic datasets. Many Arabic dialects are still not studied by researchers in Arabic sentiment analysis due to the complexity of annotators’ recruitment process during dataset creation. This paper covers the research gap in sentiment analysis for the Kuwaiti dialect by proposing a weak supervised approach to develop a large labeled dataset. Our dataset consists of over 16.6k tweets with 7,905 negatives, 7,902 positives, and 860 neutrals that spans several themes and time frames to remove any bias that might affect its content. The annotation agreement between our proposed system’s labels and human-annotated labels reports 93% for the pairwise percent agreement and 0.87 for Cohen’s kappa coefficient. Furthermore, we evaluate our dataset using multiple traditional machine learning classifiers and advanced deep learning language models to test its performance. The results report 89% accuracy when applied to the testing dataset using the ARBERT model.

查看原文本刊更多论文

科威特方言情感分析的弱监督迁移学习方法

由于可用的阿拉伯语数据集的限制，开发一个用于情感分析的系统对阿拉伯语来说是非常具有挑战性的。由于在数据集创建过程中注释者招募过程的复杂性，许多阿拉伯语方言在阿拉伯语情感分析中仍然没有被研究人员研究。本文通过提出一种弱监督方法来开发大型标记数据集，弥补了科威特方言情感分析的研究空白。我们的数据集由超过16.6k条推文组成，其中包括7905条负面推文、7902条正面推文和860条中性推文，这些推文跨越了多个主题和时间框架，以消除可能影响其内容的任何偏见。我们提出的系统标签和人类注释标签之间的标注一致性报告成对百分比一致性为93%，科恩kappa系数为0.87。此外，我们使用多个传统机器学习分类器和高级深度学习语言模型来评估我们的数据集，以测试其性能。当使用ARBERT模型应用于测试数据集时，结果报告准确率为89%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Workshop on Arabic Natural Language Processing

自引率

0.00%

发文量