EduNER: a Chinese named entity recognition dataset for education research.

IF 4.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Computing & Applications Pub Date : 2023-05-20 DOI:10.1007/s00521-023-08635-5

Xu Li, Chengkun Wei, Zhuoren Jiang, Wenlong Meng, Fan Ouyang, Zihui Zhang, Wenzhi Chen

{"title":"EduNER: a Chinese named entity recognition dataset for education research.","authors":"Xu Li, Chengkun Wei, Zhuoren Jiang, Wenlong Meng, Fan Ouyang, Zihui Zhang, Wenzhi Chen","doi":"10.1007/s00521-023-08635-5","DOIUrl":null,"url":null,"abstract":"<p><p>A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including textbooks, academic papers, and education-related web pages. The collected documents span ten years (2012-2021). A team of domain experts is invited to accomplish the education NER schema definition, and a group of trained annotators is hired to complete the annotation. A collaborative labeling platform is built for accelerating human annotation. The constructed EduNER dataset includes 16 entity types, 11k+ sentences, and 35,731 entities. We conduct a thorough statistical analysis of EduNER and summarize its distinctive characteristics by comparing it with eight open-domain or domain-specific NER datasets. Sixteen state-of-the-art models are further utilized for NER tasks validation. The experimental results can enlighten further exploration. To the best of our knowledge, EduNER is the first publicly available dataset for NER task in the education domain, which may promote the development of education-oriented NER models.</p>","PeriodicalId":49766,"journal":{"name":"Neural Computing & Applications","volume":" ","pages":"1-15"},"PeriodicalIF":4.5000,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10199663/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Computing & Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00521-023-08635-5","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including textbooks, academic papers, and education-related web pages. The collected documents span ten years (2012-2021). A team of domain experts is invited to accomplish the education NER schema definition, and a group of trained annotators is hired to complete the annotation. A collaborative labeling platform is built for accelerating human annotation. The constructed EduNER dataset includes 16 entity types, 11k+ sentences, and 35,731 entities. We conduct a thorough statistical analysis of EduNER and summarize its distinctive characteristics by comparing it with eight open-domain or domain-specific NER datasets. Sixteen state-of-the-art models are further utilized for NER tasks validation. The experimental results can enlighten further exploration. To the best of our knowledge, EduNER is the first publicly available dataset for NER task in the education domain, which may promote the development of education-oriented NER models.

Abstract Image

查看原文本刊更多论文

EduNER：一个用于教育研究的中文命名实体识别数据集。

高质量的面向领域的数据集对于特定领域的命名实体识别（NER）任务至关重要。在本研究中，我们介绍了一个新的面向教育的中国净入学率数据集（EduNER）。为了提供具有代表性和多样性的培训数据，我们从多个来源收集数据，包括教科书、学术论文和教育相关网页。收集的文件跨度为十年（2012-2021）。邀请一个领域专家团队来完成教育NER模式定义，并聘请一组训练有素的注释人员来完成注释。建立了一个协作标签平台，用于加速人工标注。构建的EduNER数据集包括16个实体类型、11k多个句子和35731个实体。我们对EduNER进行了全面的统计分析，并通过将其与八个开放领域或特定领域的NER数据集进行比较，总结了其独特特征。16个最先进的模型被进一步用于NER任务验证。实验结果对进一步的探索具有一定的启示意义。据我们所知，EduNER是教育领域第一个公开的NER任务数据集，这可能会促进面向教育的NER模型的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Computing & Applications 工程技术-计算机：人工智能

CiteScore

11.40

自引率

8.30%

发文量

1280

审稿时长

6.9 months

期刊介绍： Neural Computing & Applications is an international journal which publishes original research and other information in the field of practical applications of neural computing and related techniques such as genetic algorithms, fuzzy logic and neuro-fuzzy systems. All items relevant to building practical systems are within its scope, including but not limited to: -adaptive computing- algorithms- applicable neural networks theory- applied statistics- architectures- artificial intelligence- benchmarks- case histories of innovative applications- fuzzy logic- genetic algorithms- hardware implementations- hybrid intelligent systems- intelligent agents- intelligent control systems- intelligent diagnostics- intelligent forecasting- machine learning- neural networks- neuro-fuzzy systems- pattern recognition- performance measures- self-learning systems- software simulations- supervised and unsupervised learning methods- system engineering and integration. Featured contributions fall into several categories: Original Articles, Review Articles, Book Reviews and Announcements.