Necva Bölücü, Maciej Rybinski, Xiang Dai, Stephen Wan
{"title":"An adaptive approach to noisy annotations in scientific information extraction","authors":"Necva Bölücü, Maciej Rybinski, Xiang Dai, Stephen Wan","doi":"10.1016/j.ipm.2024.103857","DOIUrl":null,"url":null,"abstract":"<div><p>Despite recent advances in large language models (LLMs), the best effectiveness in information extraction (IE) is still achieved by fine-tuned models, hence the need for manually annotated datasets to train them. However, collecting human annotations for IE, especially for scientific IE, where expert annotators are often required, is expensive and time-consuming. Another issue widely discussed in the IE community is noisy annotations. Mislabelled training samples can hamper the effectiveness of trained models. In this paper, we propose a solution to alleviate problems originating from the high cost and difficulty of the annotation process. Our method distinguishes <em>clean</em> training samples from <em>noisy</em> samples and then employs weighted weakly supervised learning (WWSL) to leverage noisy annotations. Evaluation of Named Entity Recognition (NER) and Relation Classification (RC) tasks in Scientific IE demonstrates the substantial impact of detecting clean samples. Experimental results highlight that our method, utilising clean and noisy samples with WWSL, outperforms the baseline RoBERTa on NER (+4.28, +4.59, +29.27, and +5.21 gain for the ADE, SciERC, STEM-ECR, and WLPC datasets, respectively) and the RC (+6.09 and +4.39 gain for the SciERC and WLPC datasets, respectively) tasks. Comprehensive analyses of our method reveal its advantages over state-of-the-art denoising baseline models in scientific NER. Moreover, the framework is general enough to be adapted to different NLP tasks or domains, which means it could be useful in the broader NLP community.</p></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"61 6","pages":"Article 103857"},"PeriodicalIF":7.4000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0306457324002164/pdfft?md5=fff788405d49af01c42a5d5a7a592f76&pid=1-s2.0-S0306457324002164-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457324002164","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Despite recent advances in large language models (LLMs), the best effectiveness in information extraction (IE) is still achieved by fine-tuned models, hence the need for manually annotated datasets to train them. However, collecting human annotations for IE, especially for scientific IE, where expert annotators are often required, is expensive and time-consuming. Another issue widely discussed in the IE community is noisy annotations. Mislabelled training samples can hamper the effectiveness of trained models. In this paper, we propose a solution to alleviate problems originating from the high cost and difficulty of the annotation process. Our method distinguishes clean training samples from noisy samples and then employs weighted weakly supervised learning (WWSL) to leverage noisy annotations. Evaluation of Named Entity Recognition (NER) and Relation Classification (RC) tasks in Scientific IE demonstrates the substantial impact of detecting clean samples. Experimental results highlight that our method, utilising clean and noisy samples with WWSL, outperforms the baseline RoBERTa on NER (+4.28, +4.59, +29.27, and +5.21 gain for the ADE, SciERC, STEM-ECR, and WLPC datasets, respectively) and the RC (+6.09 and +4.39 gain for the SciERC and WLPC datasets, respectively) tasks. Comprehensive analyses of our method reveal its advantages over state-of-the-art denoising baseline models in scientific NER. Moreover, the framework is general enough to be adapted to different NLP tasks or domains, which means it could be useful in the broader NLP community.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.