精心标注数据中的标注错误检测:词性标注为例研究

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yahui Liu, Zhenghua Li, Chen Gong, Shilin Zhou, Min Zhang
{"title":"精心标注数据中的标注错误检测:词性标注为例研究","authors":"Yahui Liu,&nbsp;Zhenghua Li,&nbsp;Chen Gong,&nbsp;Shilin Zhou,&nbsp;Min Zhang","doi":"10.1016/j.eswa.2025.128374","DOIUrl":null,"url":null,"abstract":"<div><div>The annotation error detection (AED) task aims to automatically identify annotation errors in a dataset, which is crucial for ensuring the reliability and effectiveness of expert and intelligent systems across diverse applications. Most previous works either employ synthesized data, or subset of crowdsourced datasets. In contrast, this work focuses on detecting errors in painstakingly annotated data, using part-of-speech (POS) tagging as a case study. We construct a high-quality Chinese AED dataset, named CTB7E, by manually re-annotating the test set of CTB7. Among 81,578 tags, we identify approximately 1,700 erroneous tags, resulting in a 2.1 % error rate. We for the first time apply Kullback-Leibler (KL) divergence to AED and propose two new metrics. We investigate a wide range of AED approaches on both CTB7E and a synthesized dataset, under both single-model and Monte Carlo dropout settings. The results and analyses reveal interesting insights. We will release our data and code at <span><span>https://github.com/yahui19960717/POS_AED.git</span><svg><path></path></svg></span> to facilitate further research and collaboration in this area.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"290 ","pages":"Article 128374"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Annotation error detection in painstakingly annotated data: Part-of-speech tagging as a case study\",\"authors\":\"Yahui Liu,&nbsp;Zhenghua Li,&nbsp;Chen Gong,&nbsp;Shilin Zhou,&nbsp;Min Zhang\",\"doi\":\"10.1016/j.eswa.2025.128374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The annotation error detection (AED) task aims to automatically identify annotation errors in a dataset, which is crucial for ensuring the reliability and effectiveness of expert and intelligent systems across diverse applications. Most previous works either employ synthesized data, or subset of crowdsourced datasets. In contrast, this work focuses on detecting errors in painstakingly annotated data, using part-of-speech (POS) tagging as a case study. We construct a high-quality Chinese AED dataset, named CTB7E, by manually re-annotating the test set of CTB7. Among 81,578 tags, we identify approximately 1,700 erroneous tags, resulting in a 2.1 % error rate. We for the first time apply Kullback-Leibler (KL) divergence to AED and propose two new metrics. We investigate a wide range of AED approaches on both CTB7E and a synthesized dataset, under both single-model and Monte Carlo dropout settings. The results and analyses reveal interesting insights. We will release our data and code at <span><span>https://github.com/yahui19960717/POS_AED.git</span><svg><path></path></svg></span> to facilitate further research and collaboration in this area.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"290 \",\"pages\":\"Article 128374\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425019931\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425019931","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

标注错误检测(AED)任务旨在自动识别数据集中的标注错误,这对于确保专家和智能系统在不同应用中的可靠性和有效性至关重要。大多数先前的工作要么使用合成数据,要么使用众包数据集的子集。相比之下,这项工作侧重于在精心注释的数据中检测错误,使用词性(POS)标记作为案例研究。通过对CTB7的测试集进行手动重新标注,构建了一个高质量的中文AED数据集,命名为CTB7E。在81578个标签中,我们发现了大约1700个错误标签,错误率为2.1%。我们首次将Kullback-Leibler (KL)散度应用于AED,并提出了两个新的指标。我们研究了CTB7E和合成数据集上的广泛的AED方法,在单模型和蒙特卡罗辍学设置下。结果和分析揭示了有趣的见解。我们将在https://github.com/yahui19960717/POS_AED.git上发布我们的数据和代码,以促进这一领域的进一步研究和合作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Annotation error detection in painstakingly annotated data: Part-of-speech tagging as a case study
The annotation error detection (AED) task aims to automatically identify annotation errors in a dataset, which is crucial for ensuring the reliability and effectiveness of expert and intelligent systems across diverse applications. Most previous works either employ synthesized data, or subset of crowdsourced datasets. In contrast, this work focuses on detecting errors in painstakingly annotated data, using part-of-speech (POS) tagging as a case study. We construct a high-quality Chinese AED dataset, named CTB7E, by manually re-annotating the test set of CTB7. Among 81,578 tags, we identify approximately 1,700 erroneous tags, resulting in a 2.1 % error rate. We for the first time apply Kullback-Leibler (KL) divergence to AED and propose two new metrics. We investigate a wide range of AED approaches on both CTB7E and a synthesized dataset, under both single-model and Monte Carlo dropout settings. The results and analyses reveal interesting insights. We will release our data and code at https://github.com/yahui19960717/POS_AED.git to facilitate further research and collaboration in this area.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信