An adaptive approach to noisy annotations in scientific information extraction

IF 7.4 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2024-08-12 DOI:10.1016/j.ipm.2024.103857

Necva Bölücü, Maciej Rybinski, Xiang Dai, Stephen Wan

{"title":"An adaptive approach to noisy annotations in scientific information extraction","authors":"Necva Bölücü, Maciej Rybinski, Xiang Dai, Stephen Wan","doi":"10.1016/j.ipm.2024.103857","DOIUrl":null,"url":null,"abstract":"<div><p>Despite recent advances in large language models (LLMs), the best effectiveness in information extraction (IE) is still achieved by fine-tuned models, hence the need for manually annotated datasets to train them. However, collecting human annotations for IE, especially for scientific IE, where expert annotators are often required, is expensive and time-consuming. Another issue widely discussed in the IE community is noisy annotations. Mislabelled training samples can hamper the effectiveness of trained models. In this paper, we propose a solution to alleviate problems originating from the high cost and difficulty of the annotation process. Our method distinguishes <em>clean</em> training samples from <em>noisy</em> samples and then employs weighted weakly supervised learning (WWSL) to leverage noisy annotations. Evaluation of Named Entity Recognition (NER) and Relation Classification (RC) tasks in Scientific IE demonstrates the substantial impact of detecting clean samples. Experimental results highlight that our method, utilising clean and noisy samples with WWSL, outperforms the baseline RoBERTa on NER (＋4.28, ＋4.59, ＋29.27, and ＋5.21 gain for the ADE, SciERC, STEM-ECR, and WLPC datasets, respectively) and the RC (＋6.09 and ＋4.39 gain for the SciERC and WLPC datasets, respectively) tasks. Comprehensive analyses of our method reveal its advantages over state-of-the-art denoising baseline models in scientific NER. Moreover, the framework is general enough to be adapted to different NLP tasks or domains, which means it could be useful in the broader NLP community.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103857"},"PeriodicalIF":7.4000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306457324002164/pdfft?md5=fff788405d49af01c42a5d5a7a592f76&pid=1-s2.0-S0306457324002164-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324002164","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Despite recent advances in large language models (LLMs), the best effectiveness in information extraction (IE) is still achieved by fine-tuned models, hence the need for manually annotated datasets to train them. However, collecting human annotations for IE, especially for scientific IE, where expert annotators are often required, is expensive and time-consuming. Another issue widely discussed in the IE community is noisy annotations. Mislabelled training samples can hamper the effectiveness of trained models. In this paper, we propose a solution to alleviate problems originating from the high cost and difficulty of the annotation process. Our method distinguishes clean training samples from noisy samples and then employs weighted weakly supervised learning (WWSL) to leverage noisy annotations. Evaluation of Named Entity Recognition (NER) and Relation Classification (RC) tasks in Scientific IE demonstrates the substantial impact of detecting clean samples. Experimental results highlight that our method, utilising clean and noisy samples with WWSL, outperforms the baseline RoBERTa on NER (＋4.28, ＋4.59, ＋29.27, and ＋5.21 gain for the ADE, SciERC, STEM-ECR, and WLPC datasets, respectively) and the RC (＋6.09 and ＋4.39 gain for the SciERC and WLPC datasets, respectively) tasks. Comprehensive analyses of our method reveal its advantages over state-of-the-art denoising baseline models in scientific NER. Moreover, the framework is general enough to be adapted to different NLP tasks or domains, which means it could be useful in the broader NLP community.

查看原文本刊更多论文

科学信息提取中噪声注释的自适应方法

尽管最近在大型语言模型（LLMs）方面取得了进展，但信息提取（IE）的最佳效果仍然要通过微调模型来实现，因此需要人工标注的数据集来训练这些模型。然而，为信息提取（IE）收集人工标注，尤其是科学信息提取（IE），往往需要专家标注者，这既昂贵又耗时。IE 界广泛讨论的另一个问题是注释噪声。错误标注的训练样本会影响训练模型的有效性。在本文中，我们提出了一种解决方案，以缓解因标注过程成本高、难度大而产生的问题。我们的方法能将干净的训练样本与噪声样本区分开来，然后采用加权弱监督学习（WWSL）来利用噪声注释。对科学 IE 中的命名实体识别（NER）和关系分类（RC）任务的评估证明了检测干净样本的重大影响。实验结果表明，我们的方法利用了带有 WWSL 的干净样本和噪声样本，在 NER（＋4.28、＋4.59、＋29.27 和＋5.00）和 RC（＋5.00、＋5.00 和＋5.00）方面优于基线 RoBERTa。ADE、SciERC、STEM-ECR 和 WLPC 数据集的增益分别为 21）和 RC（SciERC 和 WLPC 数据集的增益分别为＋6.09 和＋4.39）任务。对我们的方法进行的综合分析表明，它比科学 NER 中最先进的去噪基线模型更具优势。此外，该框架具有足够的通用性，可以适用于不同的 NLP 任务或领域，这意味着它可以在更广泛的 NLP 社区中发挥作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.