Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation.

ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine Pub Date : 2021-08-01 DOI:10.1145/3459930.3469533

Jiho Noh, Ramakanth Kavuluru

{"title":"Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation.","authors":"Jiho Noh, Ramakanth Kavuluru","doi":"10.1145/3459930.3469533","DOIUrl":null,"url":null,"abstract":"<p><p>Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., \"B-Drug\" for the beginning of a drug) into type tags (e.g., \"Drug\") and positional tags (e.g., \"B\"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":"2021 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3459930.3469533","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3459930.3469533","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., "B-Drug" for the beginning of a drug) into type tags (e.g., "Drug") and positional tags (e.g., "B"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.

Abstract Image

查看原文本刊更多论文

生物医学NER和实体归一化的联合学习:编码方案，反事实示例和零射击评估。

命名实体识别(NER)和归一化(EN)是许多生物医学自然语言处理应用不可或缺的第一步。在生物医学信息科学中，识别实体(如基因、疾病或药物)并将其规范化为标准术语或词典中的概念(如Entrez、ICD-10或RxNorm)对于确定它们之间驱动疾病病因、进展和治疗的更多信息关系至关重要。在这项工作中，我们追求两个高水平的战略，以提高生物医学ER和EN。首先是将标准实体编码标签(例如，“B- drug”表示药物的开头)解耦为类型标签(例如，“drug”)和位置标签(例如，“B”)。第二种策略是使用额外的反事实训练示例来处理模型在训练数据中学习周围上下文和规范化概念之间的虚假关联的问题。我们使用med提及数据集进行了详细的实验，med提及数据集是生物医学中同类最大的ER和EN数据集。我们发现，与标准编码方案相比，我们的第一种策略在实体规范化方面表现更好。第二种数据增强策略统一地提高了跨度检测、类型和规范化方面的性能。当在零射击设置中评估时，对于训练中从未遇到过的概念，反事实示例的收益更加突出。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine

自引率

0.00%

发文量