探索训练数据集对土耳其语姿态检测的影响

IF 1.2 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Muhammed Said Zengin, Berk Utku Yeni̇sey, Mucahid Kutlu
{"title":"探索训练数据集对土耳其语姿态检测的影响","authors":"Muhammed Said Zengin, Berk Utku Yeni̇sey, Mucahid Kutlu","doi":"10.55730/1300-0632.4043","DOIUrl":null,"url":null,"abstract":": Stance detection has garnered considerable attention from researchers due to its broad range of applications, including fact-checking and social computing. While state-of-the-art stance detection models are usually based on supervised machine learning methods, their effectiveness is heavily reliant on the quality of training data. This problem is more prevalent in stance detection task because the stance of a text is intimately tied to the target under consideration. While numerous datasets exist for stance detection, determining their suitability for a specific target can be challenging. In this work, we focus on Turkish stance detection and explore the impact of training data on the model performance. In particular, we fine-tune BERT model with various datasets and assess their performance when the test data is the same/different compared to the training data in terms of target and domain. In addition, given the scarcity of resources for Turkish stance detection, we investigate i) whether we can use existing datasets in other languages in a cross-lingual setup, and ii) the effectiveness of data augmentation with simple automatic labeling methods. In order to conduct our experiments, we also create new Turkish stance detection datasets for various targets in different domains. In our comprehensive experiments, our findings are as follows. 1) Using training data with multiple targets in the same domain yields high performance as the model is able to learn more characteristics of expressing stance with additional data. 2) The domain of the training data plays a crucial role in achieving high performance. 3) Automatically generated data enhances performance when combined with manually annotated data. 4) Training solely on Turkish data outperforms training with the combination of Turkish and English data. Overall, our study points out the importance of creating Turkish annotated datasets for different domains to achieve high performance in stance detection.","PeriodicalId":49410,"journal":{"name":"Turkish Journal of Electrical Engineering and Computer Sciences","volume":"202 ","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Exploring the impact of training datasets on Turkish stance detection\",\"authors\":\"Muhammed Said Zengin, Berk Utku Yeni̇sey, Mucahid Kutlu\",\"doi\":\"10.55730/1300-0632.4043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": Stance detection has garnered considerable attention from researchers due to its broad range of applications, including fact-checking and social computing. While state-of-the-art stance detection models are usually based on supervised machine learning methods, their effectiveness is heavily reliant on the quality of training data. This problem is more prevalent in stance detection task because the stance of a text is intimately tied to the target under consideration. While numerous datasets exist for stance detection, determining their suitability for a specific target can be challenging. In this work, we focus on Turkish stance detection and explore the impact of training data on the model performance. In particular, we fine-tune BERT model with various datasets and assess their performance when the test data is the same/different compared to the training data in terms of target and domain. In addition, given the scarcity of resources for Turkish stance detection, we investigate i) whether we can use existing datasets in other languages in a cross-lingual setup, and ii) the effectiveness of data augmentation with simple automatic labeling methods. In order to conduct our experiments, we also create new Turkish stance detection datasets for various targets in different domains. In our comprehensive experiments, our findings are as follows. 1) Using training data with multiple targets in the same domain yields high performance as the model is able to learn more characteristics of expressing stance with additional data. 2) The domain of the training data plays a crucial role in achieving high performance. 3) Automatically generated data enhances performance when combined with manually annotated data. 4) Training solely on Turkish data outperforms training with the combination of Turkish and English data. Overall, our study points out the importance of creating Turkish annotated datasets for different domains to achieve high performance in stance detection.\",\"PeriodicalId\":49410,\"journal\":{\"name\":\"Turkish Journal of Electrical Engineering and Computer Sciences\",\"volume\":\"202 \",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2023-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Turkish Journal of Electrical Engineering and Computer Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.55730/1300-0632.4043\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Turkish Journal of Electrical Engineering and Computer Sciences","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.55730/1300-0632.4043","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

:立场检测具有广泛的应用领域,包括事实核查和社交计算,因此受到研究人员的极大关注。最先进的立场检测模型通常基于有监督的机器学习方法,但其有效性在很大程度上取决于训练数据的质量。这个问题在立场检测任务中更为普遍,因为文本的立场与所考虑的目标密切相关。虽然有许多用于立场检测的数据集,但要确定这些数据集是否适用于特定目标却很有难度。在这项工作中,我们专注于土耳其语的立场检测,并探索训练数据对模型性能的影响。特别是,我们利用各种数据集对 BERT 模型进行了微调,并评估了当测试数据在目标和领域方面与训练数据相同/不同时的性能。此外,考虑到土耳其语立场检测资源的稀缺性,我们研究了 i) 我们是否可以在跨语言设置中使用其他语言的现有数据集,以及 ii) 使用简单的自动标记方法进行数据扩充的有效性。为了进行实验,我们还针对不同领域的不同目标创建了新的土耳其语立场检测数据集。在综合实验中,我们得出了以下结论。1) 使用同一领域中多个目标的训练数据会产生较高的性能,因为模型能够通过额外的数据学习到更多表达姿态的特征。2) 训练数据的领域对实现高性能起着至关重要的作用。3) 自动生成的数据与人工标注的数据相结合,可以提高性能。4) 仅使用土耳其语数据进行训练的效果优于结合土耳其语和英语数据进行训练的效果。总之,我们的研究指出了为不同领域创建土耳其语注释数据集对实现高性能姿态检测的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Exploring the impact of training datasets on Turkish stance detection
: Stance detection has garnered considerable attention from researchers due to its broad range of applications, including fact-checking and social computing. While state-of-the-art stance detection models are usually based on supervised machine learning methods, their effectiveness is heavily reliant on the quality of training data. This problem is more prevalent in stance detection task because the stance of a text is intimately tied to the target under consideration. While numerous datasets exist for stance detection, determining their suitability for a specific target can be challenging. In this work, we focus on Turkish stance detection and explore the impact of training data on the model performance. In particular, we fine-tune BERT model with various datasets and assess their performance when the test data is the same/different compared to the training data in terms of target and domain. In addition, given the scarcity of resources for Turkish stance detection, we investigate i) whether we can use existing datasets in other languages in a cross-lingual setup, and ii) the effectiveness of data augmentation with simple automatic labeling methods. In order to conduct our experiments, we also create new Turkish stance detection datasets for various targets in different domains. In our comprehensive experiments, our findings are as follows. 1) Using training data with multiple targets in the same domain yields high performance as the model is able to learn more characteristics of expressing stance with additional data. 2) The domain of the training data plays a crucial role in achieving high performance. 3) Automatically generated data enhances performance when combined with manually annotated data. 4) Training solely on Turkish data outperforms training with the combination of Turkish and English data. Overall, our study points out the importance of creating Turkish annotated datasets for different domains to achieve high performance in stance detection.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Turkish Journal of Electrical Engineering and Computer Sciences
Turkish Journal of Electrical Engineering and Computer Sciences COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
2.90
自引率
9.10%
发文量
95
审稿时长
6.9 months
期刊介绍: The Turkish Journal of Electrical Engineering & Computer Sciences is published electronically 6 times a year by the Scientific and Technological Research Council of Turkey (TÜBİTAK) Accepts English-language manuscripts in the areas of power and energy, environmental sustainability and energy efficiency, electronics, industry applications, control systems, information and systems, applied electromagnetics, communications, signal and image processing, tomographic image reconstruction, face recognition, biometrics, speech processing, video processing and analysis, object recognition, classification, feature extraction, parallel and distributed computing, cognitive systems, interaction, robotics, digital libraries and content, personalized healthcare, ICT for mobility, sensors, and artificial intelligence. Contribution is open to researchers of all nationalities.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信