科学信息提取中噪声注释的自适应方法

IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Necva Bölücü, Maciej Rybinski, Xiang Dai, Stephen Wan
{"title":"科学信息提取中噪声注释的自适应方法","authors":"Necva Bölücü,&nbsp;Maciej Rybinski,&nbsp;Xiang Dai,&nbsp;Stephen Wan","doi":"10.1016/j.ipm.2024.103857","DOIUrl":null,"url":null,"abstract":"<div><p>Despite recent advances in large language models (LLMs), the best effectiveness in information extraction (IE) is still achieved by fine-tuned models, hence the need for manually annotated datasets to train them. However, collecting human annotations for IE, especially for scientific IE, where expert annotators are often required, is expensive and time-consuming. Another issue widely discussed in the IE community is noisy annotations. Mislabelled training samples can hamper the effectiveness of trained models. In this paper, we propose a solution to alleviate problems originating from the high cost and difficulty of the annotation process. Our method distinguishes <em>clean</em> training samples from <em>noisy</em> samples and then employs weighted weakly supervised learning (WWSL) to leverage noisy annotations. Evaluation of Named Entity Recognition (NER) and Relation Classification (RC) tasks in Scientific IE demonstrates the substantial impact of detecting clean samples. Experimental results highlight that our method, utilising clean and noisy samples with WWSL, outperforms the baseline RoBERTa on NER (+4.28, +4.59, +29.27, and +5.21 gain for the ADE, SciERC, STEM-ECR, and WLPC datasets, respectively) and the RC (+6.09 and +4.39 gain for the SciERC and WLPC datasets, respectively) tasks. Comprehensive analyses of our method reveal its advantages over state-of-the-art denoising baseline models in scientific NER. Moreover, the framework is general enough to be adapted to different NLP tasks or domains, which means it could be useful in the broader NLP community.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103857"},"PeriodicalIF":7.4000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306457324002164/pdfft?md5=fff788405d49af01c42a5d5a7a592f76&pid=1-s2.0-S0306457324002164-main.pdf","citationCount":"0","resultStr":"{\"title\":\"An adaptive approach to noisy annotations in scientific information extraction\",\"authors\":\"Necva Bölücü,&nbsp;Maciej Rybinski,&nbsp;Xiang Dai,&nbsp;Stephen Wan\",\"doi\":\"10.1016/j.ipm.2024.103857\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Despite recent advances in large language models (LLMs), the best effectiveness in information extraction (IE) is still achieved by fine-tuned models, hence the need for manually annotated datasets to train them. However, collecting human annotations for IE, especially for scientific IE, where expert annotators are often required, is expensive and time-consuming. Another issue widely discussed in the IE community is noisy annotations. Mislabelled training samples can hamper the effectiveness of trained models. In this paper, we propose a solution to alleviate problems originating from the high cost and difficulty of the annotation process. Our method distinguishes <em>clean</em> training samples from <em>noisy</em> samples and then employs weighted weakly supervised learning (WWSL) to leverage noisy annotations. Evaluation of Named Entity Recognition (NER) and Relation Classification (RC) tasks in Scientific IE demonstrates the substantial impact of detecting clean samples. Experimental results highlight that our method, utilising clean and noisy samples with WWSL, outperforms the baseline RoBERTa on NER (+4.28, +4.59, +29.27, and +5.21 gain for the ADE, SciERC, STEM-ECR, and WLPC datasets, respectively) and the RC (+6.09 and +4.39 gain for the SciERC and WLPC datasets, respectively) tasks. Comprehensive analyses of our method reveal its advantages over state-of-the-art denoising baseline models in scientific NER. Moreover, the framework is general enough to be adapted to different NLP tasks or domains, which means it could be useful in the broader NLP community.</p></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"61 6\",\"pages\":\"Article 103857\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2024-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0306457324002164/pdfft?md5=fff788405d49af01c42a5d5a7a592f76&pid=1-s2.0-S0306457324002164-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457324002164\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324002164","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

尽管最近在大型语言模型(LLMs)方面取得了进展,但信息提取(IE)的最佳效果仍然要通过微调模型来实现,因此需要人工标注的数据集来训练这些模型。然而,为信息提取(IE)收集人工标注,尤其是科学信息提取(IE),往往需要专家标注者,这既昂贵又耗时。IE 界广泛讨论的另一个问题是注释噪声。错误标注的训练样本会影响训练模型的有效性。在本文中,我们提出了一种解决方案,以缓解因标注过程成本高、难度大而产生的问题。我们的方法能将干净的训练样本与噪声样本区分开来,然后采用加权弱监督学习(WWSL)来利用噪声注释。对科学 IE 中的命名实体识别(NER)和关系分类(RC)任务的评估证明了检测干净样本的重大影响。实验结果表明,我们的方法利用了带有 WWSL 的干净样本和噪声样本,在 NER(+4.28、+4.59、+29.27 和 +5.00)和 RC(+5.00、+5.00 和 +5.00)方面优于基线 RoBERTa。ADE、SciERC、STEM-ECR 和 WLPC 数据集的增益分别为 21)和 RC(SciERC 和 WLPC 数据集的增益分别为 +6.09 和 +4.39)任务。对我们的方法进行的综合分析表明,它比科学 NER 中最先进的去噪基线模型更具优势。此外,该框架具有足够的通用性,可以适用于不同的 NLP 任务或领域,这意味着它可以在更广泛的 NLP 社区中发挥作用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
An adaptive approach to noisy annotations in scientific information extraction

Despite recent advances in large language models (LLMs), the best effectiveness in information extraction (IE) is still achieved by fine-tuned models, hence the need for manually annotated datasets to train them. However, collecting human annotations for IE, especially for scientific IE, where expert annotators are often required, is expensive and time-consuming. Another issue widely discussed in the IE community is noisy annotations. Mislabelled training samples can hamper the effectiveness of trained models. In this paper, we propose a solution to alleviate problems originating from the high cost and difficulty of the annotation process. Our method distinguishes clean training samples from noisy samples and then employs weighted weakly supervised learning (WWSL) to leverage noisy annotations. Evaluation of Named Entity Recognition (NER) and Relation Classification (RC) tasks in Scientific IE demonstrates the substantial impact of detecting clean samples. Experimental results highlight that our method, utilising clean and noisy samples with WWSL, outperforms the baseline RoBERTa on NER (+4.28, +4.59, +29.27, and +5.21 gain for the ADE, SciERC, STEM-ECR, and WLPC datasets, respectively) and the RC (+6.09 and +4.39 gain for the SciERC and WLPC datasets, respectively) tasks. Comprehensive analyses of our method reveal its advantages over state-of-the-art denoising baseline models in scientific NER. Moreover, the framework is general enough to be adapted to different NLP tasks or domains, which means it could be useful in the broader NLP community.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Information Processing & Management
Information Processing & Management 工程技术-计算机:信息系统
CiteScore
17.00
自引率
11.60%
发文量
276
审稿时长
39 days
期刊介绍: Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信