用于文档级关系提取的基于置信度的自适应数据修订框架

IF 7.4 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2024-09-26 DOI:10.1016/j.ipm.2024.103909

Chao Jiang , Jinzhi Liao , Xiang Zhao , Daojian Zeng , Jianhua Dai

{"title":"用于文档级关系提取的基于置信度的自适应数据修订框架","authors":"Chao Jiang , Jinzhi Liao , Xiang Zhao , Daojian Zeng , Jianhua Dai","doi":"10.1016/j.ipm.2024.103909","DOIUrl":null,"url":null,"abstract":"<div><div>Noisy annotations have become a key issue limiting Document-level Relation Extraction (DocRE). Previous research explored the problem through manual re-annotation. However, the handcrafted strategy is of low efficiency, incurs high human costs and cannot be generalized to large-scale datasets. To address the problem, we construct a confidence-based Revision framework for DocRE (ReD), aiming to achieve high-quality automatic data revision. Specifically, we first introduce a denoising training module to recognize relational facts and prevent noisy annotations. Second, a confidence-based data revision module is equipped to perform adaptive data revision for long-tail distributed relational facts. After the data revision, we design an iterative training module to create a virtuous cycle, which transforms the revised data into useful training data to support further revision. By capitalizing on ReD, we propose ReD-DocRED, which consists of 101,873 revised annotated documents from DocRED. ReD-DocRED has introduced 57.1% new relational facts, and concurrently, models trained on ReD-DocRED have achieved significant improvements in F1 scores, ranging from 6.35 to 16.55. The experimental results demonstrate that ReD can achieve high-quality data revision and, to some extent, replace manual labeling.1</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 1","pages":"Article 103909"},"PeriodicalIF":7.4000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An adaptive confidence-based data revision framework for Document-level Relation Extraction\",\"authors\":\"Chao Jiang , Jinzhi Liao , Xiang Zhao , Daojian Zeng , Jianhua Dai\",\"doi\":\"10.1016/j.ipm.2024.103909\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Noisy annotations have become a key issue limiting Document-level Relation Extraction (DocRE). Previous research explored the problem through manual re-annotation. However, the handcrafted strategy is of low efficiency, incurs high human costs and cannot be generalized to large-scale datasets. To address the problem, we construct a confidence-based Revision framework for DocRE (ReD), aiming to achieve high-quality automatic data revision. Specifically, we first introduce a denoising training module to recognize relational facts and prevent noisy annotations. Second, a confidence-based data revision module is equipped to perform adaptive data revision for long-tail distributed relational facts. After the data revision, we design an iterative training module to create a virtuous cycle, which transforms the revised data into useful training data to support further revision. By capitalizing on ReD, we propose ReD-DocRED, which consists of 101,873 revised annotated documents from DocRED. ReD-DocRED has introduced 57.1% new relational facts, and concurrently, models trained on ReD-DocRED have achieved significant improvements in F1 scores, ranging from 6.35 to 16.55. The experimental results demonstrate that ReD can achieve high-quality data revision and, to some extent, replace manual labeling.1</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"62 1\",\"pages\":\"Article 103909\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324002681\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324002681","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

嘈杂的注释已成为限制文档级关系提取（DocRE）的一个关键问题。以往的研究通过人工重新标注来解决这一问题。然而，这种手工策略效率低、人力成本高，而且无法推广到大规模数据集。为解决这一问题，我们构建了基于置信度的 DocRE 修订框架（ReD），旨在实现高质量的数据自动修订。具体来说，我们首先引入了一个去噪训练模块，以识别关系事实并防止出现噪声注释。其次，配备基于置信度的数据修订模块，对长尾分布式关系事实进行自适应数据修订。数据修订后，我们设计了一个迭代训练模块，以创建一个良性循环，将修订后的数据转化为有用的训练数据，以支持进一步的修订。通过利用 ReD，我们提出了 ReD-DocRED，它由来自 DocRED 的 101,873 份修订注释文档组成。ReD-DocRED 引入了 57.1% 的新关系事实，同时，在 ReD-DocRED 上训练的模型的 F1 分数也有了显著提高，从 6.35 到 16.55 不等。实验结果表明，ReD 可以实现高质量的数据修订，并在一定程度上取代人工标注1。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An adaptive confidence-based data revision framework for Document-level Relation Extraction

Noisy annotations have become a key issue limiting Document-level Relation Extraction (DocRE). Previous research explored the problem through manual re-annotation. However, the handcrafted strategy is of low efficiency, incurs high human costs and cannot be generalized to large-scale datasets. To address the problem, we construct a confidence-based Revision framework for DocRE (ReD), aiming to achieve high-quality automatic data revision. Specifically, we first introduce a denoising training module to recognize relational facts and prevent noisy annotations. Second, a confidence-based data revision module is equipped to perform adaptive data revision for long-tail distributed relational facts. After the data revision, we design an iterative training module to create a virtuous cycle, which transforms the revised data into useful training data to support further revision. By capitalizing on ReD, we propose ReD-DocRED, which consists of 101,873 revised annotated documents from DocRED. ReD-DocRED has introduced 57.1% new relational facts, and concurrently, models trained on ReD-DocRED have achieved significant improvements in F1 scores, ranging from 6.35 to 16.55. The experimental results demonstrate that ReD can achieve high-quality data revision and, to some extent, replace manual labeling.¹

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.