Bochun：自动标注的Sorani库尔德语姿态检测数据集

IF 1.4 Q3 MULTIDISCIPLINARY SCIENCES

Data in Brief Pub Date : 2025-06-25 DOI:10.1016/j.dib.2025.111839

Payman Sabr Rostam , Rebwar Mala Nabi

{"title":"Bochun：自动标注的Sorani库尔德语姿态检测数据集","authors":"Payman Sabr Rostam , Rebwar Mala Nabi","doi":"10.1016/j.dib.2025.111839","DOIUrl":null,"url":null,"abstract":"<div><div>This Research presents the first-ever, high-quality, automatically annotated Kurdish stance detection dataset in the Sorani dialect to fill the gap of lacking annotated resources for Kurdish, a low-resource language in Natural Language Processing (NLP). The dataset consists of 2,174 Kurdish news articles—1,410 economic and 764 political—that were originally published in 2024 and 2025, which are recent and topically relevant. By selecting these texts from well-known Kurdish news agencies, content validity and linguistic purity were preserved throughout. Necessary preprocessing techniques are applied. Annotation is carried out in two steps. First, a pattern-recognition method with 2,456 phrases and keywords was applied to determine if the subject of every text fell into the economics or politics category. Next, the position of every article was annotated with an extended lexicon of 4,243 adjectives and verbs, categorized under support, oppose, and neutral. Wherever direct matches were not possible, semantic similarity and zero-shot classification were used as fallback measures. In order to verify the automatic annotation, a team of domain experts manually assessed a representative sample of the annotated texts, with a high inter-annotator agreement score confirming the validity of the approach. The dataset is made available in XLSX (Excel) format, facilitating ease of use and versatility for a variety of research tasks in NLP. Due to its annotated and organized corpus, this dataset is a solid starting point for researchers who are building Kurdish language processing models. The dataset is released publicly to allow other researchers to build upon it and push the limits of NLP system performance on low-resource languages.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"61 ","pages":"Article 111839"},"PeriodicalIF":1.4000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bochun: Automatically annotated stance detection dataset for Sorani Kurdish language\",\"authors\":\"Payman Sabr Rostam , Rebwar Mala Nabi\",\"doi\":\"10.1016/j.dib.2025.111839\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This Research presents the first-ever, high-quality, automatically annotated Kurdish stance detection dataset in the Sorani dialect to fill the gap of lacking annotated resources for Kurdish, a low-resource language in Natural Language Processing (NLP). The dataset consists of 2,174 Kurdish news articles—1,410 economic and 764 political—that were originally published in 2024 and 2025, which are recent and topically relevant. By selecting these texts from well-known Kurdish news agencies, content validity and linguistic purity were preserved throughout. Necessary preprocessing techniques are applied. Annotation is carried out in two steps. First, a pattern-recognition method with 2,456 phrases and keywords was applied to determine if the subject of every text fell into the economics or politics category. Next, the position of every article was annotated with an extended lexicon of 4,243 adjectives and verbs, categorized under support, oppose, and neutral. Wherever direct matches were not possible, semantic similarity and zero-shot classification were used as fallback measures. In order to verify the automatic annotation, a team of domain experts manually assessed a representative sample of the annotated texts, with a high inter-annotator agreement score confirming the validity of the approach. The dataset is made available in XLSX (Excel) format, facilitating ease of use and versatility for a variety of research tasks in NLP. Due to its annotated and organized corpus, this dataset is a solid starting point for researchers who are building Kurdish language processing models. The dataset is released publicly to allow other researchers to build upon it and push the limits of NLP system performance on low-resource languages.</div></div>\",\"PeriodicalId\":10973,\"journal\":{\"name\":\"Data in Brief\",\"volume\":\"61 \",\"pages\":\"Article 111839\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data in Brief\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2352340925005669\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925005669","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

本研究提出了首个基于Sorani方言的高质量自动标注库尔德语姿态检测数据集，以填补自然语言处理（NLP）中资源匮乏的库尔德语缺乏标注资源的空白。该数据集由2174篇库尔德新闻文章组成，其中1410篇是经济新闻，764篇是政治新闻，这些文章最初发表于2024年和2025年，是最近的，与主题相关。通过选择这些文本从著名的库尔德新闻机构，内容的有效性和语言的纯洁性一直保持。应用了必要的预处理技术。注释分两个步骤进行。首先，使用2456个短语和关键词的模式识别方法来确定每个文本的主题是否属于经济或政治类别。接下来，用包含4243个形容词和动词的扩展词典对每篇文章的位置进行注释，这些形容词和动词按支持、反对和中性分类。在无法直接匹配的情况下，使用语义相似性和零射击分类作为后备措施。为了验证自动注释，领域专家团队手动评估了注释文本的代表性样本，具有较高的注释者间协议分数，确认了该方法的有效性。该数据集以XLSX （Excel）格式提供，便于使用和多功能性，适用于NLP中的各种研究任务。由于其注释和组织的语料库，该数据集是建立库尔德语言处理模型的研究人员的坚实起点。该数据集公开发布，以允许其他研究人员在此基础上进行构建，并在低资源语言上推动NLP系统性能的极限。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Bochun: Automatically annotated stance detection dataset for Sorani Kurdish language

This Research presents the first-ever, high-quality, automatically annotated Kurdish stance detection dataset in the Sorani dialect to fill the gap of lacking annotated resources for Kurdish, a low-resource language in Natural Language Processing (NLP). The dataset consists of 2,174 Kurdish news articles—1,410 economic and 764 political—that were originally published in 2024 and 2025, which are recent and topically relevant. By selecting these texts from well-known Kurdish news agencies, content validity and linguistic purity were preserved throughout. Necessary preprocessing techniques are applied. Annotation is carried out in two steps. First, a pattern-recognition method with 2,456 phrases and keywords was applied to determine if the subject of every text fell into the economics or politics category. Next, the position of every article was annotated with an extended lexicon of 4,243 adjectives and verbs, categorized under support, oppose, and neutral. Wherever direct matches were not possible, semantic similarity and zero-shot classification were used as fallback measures. In order to verify the automatic annotation, a team of domain experts manually assessed a representative sample of the annotated texts, with a high inter-annotator agreement score confirming the validity of the approach. The dataset is made available in XLSX (Excel) format, facilitating ease of use and versatility for a variety of research tasks in NLP. Due to its annotated and organized corpus, this dataset is a solid starting point for researchers who are building Kurdish language processing models. The dataset is released publicly to allow other researchers to build upon it and push the limits of NLP system performance on low-resource languages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.