Q8VaxStance: Dataset Labeling System for Stance Detection towards Vaccines in Kuwaiti Dialect

IF 4.4 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data and Cognitive Computing Pub Date : 2023-09-15 DOI:10.3390/bdcc7030151

Hana Alostad, Shoug Dawiek, Hasan Davulcu

{"title":"Q8VaxStance: Dataset Labeling System for Stance Detection towards Vaccines in Kuwaiti Dialect","authors":"Hana Alostad, Shoug Dawiek, Hasan Davulcu","doi":"10.3390/bdcc7030151","DOIUrl":null,"url":null,"abstract":"The Kuwaiti dialect is a particular dialect of Arabic spoken in Kuwait; it differs significantly from standard Arabic and the dialects of neighboring countries in the same region. Few research papers with a focus on the Kuwaiti dialect have been published in the field of NLP. In this study, we created Kuwaiti dialect language resources using Q8VaxStance, a vaccine stance labeling system for a large dataset of tweets. This dataset fills this gap and provides a valuable resource for researchers studying vaccine hesitancy in Kuwait. Furthermore, it contributes to the Arabic natural language processing field by providing a dataset for developing and evaluating machine learning models for stance detection in the Kuwaiti dialect. The proposed vaccine stance labeling system combines the benefits of weak supervised learning and zero-shot learning; for this purpose, we implemented 52 experiments on 42,815 unlabeled tweets extracted between December 2020 and July 2022. The results of the experiments show that using keyword detection in conjunction with zero-shot model labeling functions is significantly better than using only keyword detection labeling functions or just zero-shot model labeling functions. Furthermore, for the total number of generated labels, the difference between using the Arabic language in both the labels and prompt or a mix of Arabic labels and an English prompt is statistically significant, indicating that it generates more labels than when using English in both the labels and prompt. The best accuracy achieved in our experiments in terms of the Macro-F1 values was found when using keyword and hashtag detection labeling functions in conjunction with zero-shot model labeling functions, specifically in experiments KHZSLF-EE4 and KHZSLF-EA1, with values of 0.83 and 0.83, respectively. Experiment KHZSLF-EE4 was able to label 42,270 tweets, while experiment KHZSLF-EA1 was able to label 42,764 tweets. Finally, the average value of annotation agreement between the generated labels and human labels ranges between 0.61 and 0.64, which is considered a good level of agreement.","PeriodicalId":36397,"journal":{"name":"Big Data and Cognitive Computing","volume":"206 1","pages":"0"},"PeriodicalIF":4.4000,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data and Cognitive Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/bdcc7030151","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The Kuwaiti dialect is a particular dialect of Arabic spoken in Kuwait; it differs significantly from standard Arabic and the dialects of neighboring countries in the same region. Few research papers with a focus on the Kuwaiti dialect have been published in the field of NLP. In this study, we created Kuwaiti dialect language resources using Q8VaxStance, a vaccine stance labeling system for a large dataset of tweets. This dataset fills this gap and provides a valuable resource for researchers studying vaccine hesitancy in Kuwait. Furthermore, it contributes to the Arabic natural language processing field by providing a dataset for developing and evaluating machine learning models for stance detection in the Kuwaiti dialect. The proposed vaccine stance labeling system combines the benefits of weak supervised learning and zero-shot learning; for this purpose, we implemented 52 experiments on 42,815 unlabeled tweets extracted between December 2020 and July 2022. The results of the experiments show that using keyword detection in conjunction with zero-shot model labeling functions is significantly better than using only keyword detection labeling functions or just zero-shot model labeling functions. Furthermore, for the total number of generated labels, the difference between using the Arabic language in both the labels and prompt or a mix of Arabic labels and an English prompt is statistically significant, indicating that it generates more labels than when using English in both the labels and prompt. The best accuracy achieved in our experiments in terms of the Macro-F1 values was found when using keyword and hashtag detection labeling functions in conjunction with zero-shot model labeling functions, specifically in experiments KHZSLF-EE4 and KHZSLF-EA1, with values of 0.83 and 0.83, respectively. Experiment KHZSLF-EE4 was able to label 42,270 tweets, while experiment KHZSLF-EA1 was able to label 42,764 tweets. Finally, the average value of annotation agreement between the generated labels and human labels ranges between 0.61 and 0.64, which is considered a good level of agreement.

查看原文本刊更多论文

科威特方言疫苗姿态检测的数据集标记系统

科威特方言是科威特使用的一种特殊的阿拉伯语方言;它与标准阿拉伯语和同一地区邻国的方言有很大的不同。在自然语言处理领域，以科威特方言为研究对象的研究论文很少。在这项研究中，我们使用Q8VaxStance创建了科威特方言语言资源，Q8VaxStance是一个针对大型推文数据集的疫苗姿态标记系统。该数据集填补了这一空白，并为研究科威特疫苗犹豫的研究人员提供了宝贵的资源。此外，它通过提供用于开发和评估科威特方言的姿态检测机器学习模型的数据集，为阿拉伯语自然语言处理领域做出了贡献。所提出的疫苗姿态标注系统结合了弱监督学习和零次学习的优点;为此，我们对2020年12月至2022年7月期间提取的42,815条未标记推文进行了52次实验。实验结果表明，将关键词检测与零射击模型标注函数结合使用明显优于仅使用关键词检测标注函数或仅使用零射击模型标注函数。此外，对于生成的标签总数，在标签和提示符中同时使用阿拉伯文或在阿拉伯文标签和英文提示符中混合使用阿拉伯文之间的差异具有统计学意义，这表明在标签和提示符中同时使用英文时生成的标签更多。在我们的实验中，当关键字和标签检测标注函数与零射击模型标注函数结合使用时，Macro-F1值的准确率最高，其中实验KHZSLF-EE4和KHZSLF-EA1的准确率分别为0.83和0.83。实验KHZSLF-EE4能够标记42,270条推文，而实验KHZSLF-EA1能够标记42,764条推文。最后，生成的标签与人工标签的标注一致性平均值在0.61 ~ 0.64之间，达到了较好的一致性水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊