精心标注数据中的标注错误检测：词性标注为例研究

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-05-31 DOI:10.1016/j.eswa.2025.128374

Yahui Liu, Zhenghua Li, Chen Gong, Shilin Zhou, Min Zhang

{"title":"精心标注数据中的标注错误检测：词性标注为例研究","authors":"Yahui Liu, Zhenghua Li, Chen Gong, Shilin Zhou, Min Zhang","doi":"10.1016/j.eswa.2025.128374","DOIUrl":null,"url":null,"abstract":"<div><div>The annotation error detection (AED) task aims to automatically identify annotation errors in a dataset, which is crucial for ensuring the reliability and effectiveness of expert and intelligent systems across diverse applications. Most previous works either employ synthesized data, or subset of crowdsourced datasets. In contrast, this work focuses on detecting errors in painstakingly annotated data, using part-of-speech (POS) tagging as a case study. We construct a high-quality Chinese AED dataset, named CTB7E, by manually re-annotating the test set of CTB7. Among 81,578 tags, we identify approximately 1,700 erroneous tags, resulting in a 2.1 % error rate. We for the first time apply Kullback-Leibler (KL) divergence to AED and propose two new metrics. We investigate a wide range of AED approaches on both CTB7E and a synthesized dataset, under both single-model and Monte Carlo dropout settings. The results and analyses reveal interesting insights. We will release our data and code at <span><span>https://github.com/yahui19960717/POS_AED.git</span><svg><path></path></svg></span> to facilitate further research and collaboration in this area.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"290 ","pages":"Article 128374"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Annotation error detection in painstakingly annotated data: Part-of-speech tagging as a case study\",\"authors\":\"Yahui Liu, Zhenghua Li, Chen Gong, Shilin Zhou, Min Zhang\",\"doi\":\"10.1016/j.eswa.2025.128374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The annotation error detection (AED) task aims to automatically identify annotation errors in a dataset, which is crucial for ensuring the reliability and effectiveness of expert and intelligent systems across diverse applications. Most previous works either employ synthesized data, or subset of crowdsourced datasets. In contrast, this work focuses on detecting errors in painstakingly annotated data, using part-of-speech (POS) tagging as a case study. We construct a high-quality Chinese AED dataset, named CTB7E, by manually re-annotating the test set of CTB7. Among 81,578 tags, we identify approximately 1,700 erroneous tags, resulting in a 2.1 % error rate. We for the first time apply Kullback-Leibler (KL) divergence to AED and propose two new metrics. We investigate a wide range of AED approaches on both CTB7E and a synthesized dataset, under both single-model and Monte Carlo dropout settings. The results and analyses reveal interesting insights. We will release our data and code at <span><span>https://github.com/yahui19960717/POS_AED.git</span><svg><path></path></svg></span> to facilitate further research and collaboration in this area.</div></div>\",\"PeriodicalId\":50461,\"journal\":{\"name\":\"Expert Systems with Applications\",\"volume\":\"290 \",\"pages\":\"Article 128374\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Systems with Applications\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0957417425019931\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425019931","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

标注错误检测（AED）任务旨在自动识别数据集中的标注错误，这对于确保专家和智能系统在不同应用中的可靠性和有效性至关重要。大多数先前的工作要么使用合成数据，要么使用众包数据集的子集。相比之下，这项工作侧重于在精心注释的数据中检测错误，使用词性（POS）标记作为案例研究。通过对CTB7的测试集进行手动重新标注，构建了一个高质量的中文AED数据集，命名为CTB7E。在81578个标签中，我们发现了大约1700个错误标签，错误率为2.1%。我们首次将Kullback-Leibler （KL）散度应用于AED，并提出了两个新的指标。我们研究了CTB7E和合成数据集上的广泛的AED方法，在单模型和蒙特卡罗辍学设置下。结果和分析揭示了有趣的见解。我们将在https://github.com/yahui19960717/POS_AED.git上发布我们的数据和代码，以促进这一领域的进一步研究和合作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Annotation error detection in painstakingly annotated data: Part-of-speech tagging as a case study

The annotation error detection (AED) task aims to automatically identify annotation errors in a dataset, which is crucial for ensuring the reliability and effectiveness of expert and intelligent systems across diverse applications. Most previous works either employ synthesized data, or subset of crowdsourced datasets. In contrast, this work focuses on detecting errors in painstakingly annotated data, using part-of-speech (POS) tagging as a case study. We construct a high-quality Chinese AED dataset, named CTB7E, by manually re-annotating the test set of CTB7. Among 81,578 tags, we identify approximately 1,700 erroneous tags, resulting in a 2.1 % error rate. We for the first time apply Kullback-Leibler (KL) divergence to AED and propose two new metrics. We investigate a wide range of AED approaches on both CTB7E and a synthesized dataset, under both single-model and Monte Carlo dropout settings. The results and analyses reveal interesting insights. We will release our data and code at https://github.com/yahui19960717/POS_AED.git to facilitate further research and collaboration in this area.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.