A metric learning-based method for biomedical entity linking

Ngoc D. Le, Nhung T. H. Nguyen
{"title":"A metric learning-based method for biomedical entity linking","authors":"Ngoc D. Le, Nhung T. H. Nguyen","doi":"10.3389/frma.2023.1247094","DOIUrl":null,"url":null,"abstract":"Biomedical entity linking task is the task of mapping mention(s) that occur in a particular textual context to a unique concept or entity in a knowledge base, e.g., the Unified Medical Language System (UMLS). One of the most challenging aspects of the entity linking task is the ambiguity of mentions, i.e., (1) mentions whose surface forms are very similar, but which map to different entities in different contexts, and (2) entities that can be expressed using diverse types of mentions. Recent studies have used BERT-based encoders to encode mentions and entities into distinguishable representations such that their similarity can be measured using distance metrics. However, most real-world biomedical datasets suffer from severe imbalance, i.e., some classes have many instances while others appear only once or are completely absent from the training data. A common way to address this issue is to down-sample the dataset, i.e., to reduce the number instances of the majority classes to make the dataset more balanced. In the context of entity linking, down-sampling reduces the ability of the model to comprehensively learn the representations of mentions in different contexts, which is very important. To tackle this issue, we propose a metric-based learning method that treats a given entity and its mentions as a whole, regardless of the number of mentions in the training set. Specifically, our method uses a triplet loss-based function in conjunction with a clustering technique to learn the representation of mentions and entities. Through evaluations on two challenging biomedical datasets, i.e., MedMentions and BC5CDR, we show that our proposed method is able to address the issue of imbalanced data and to perform competitively with other state-of-the-art models. Moreover, our method significantly reduces computational cost in both training and inference steps. Our source code is publicly available here.","PeriodicalId":73104,"journal":{"name":"Frontiers in research metrics and analytics","volume":" 1057","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in research metrics and analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frma.2023.1247094","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Biomedical entity linking task is the task of mapping mention(s) that occur in a particular textual context to a unique concept or entity in a knowledge base, e.g., the Unified Medical Language System (UMLS). One of the most challenging aspects of the entity linking task is the ambiguity of mentions, i.e., (1) mentions whose surface forms are very similar, but which map to different entities in different contexts, and (2) entities that can be expressed using diverse types of mentions. Recent studies have used BERT-based encoders to encode mentions and entities into distinguishable representations such that their similarity can be measured using distance metrics. However, most real-world biomedical datasets suffer from severe imbalance, i.e., some classes have many instances while others appear only once or are completely absent from the training data. A common way to address this issue is to down-sample the dataset, i.e., to reduce the number instances of the majority classes to make the dataset more balanced. In the context of entity linking, down-sampling reduces the ability of the model to comprehensively learn the representations of mentions in different contexts, which is very important. To tackle this issue, we propose a metric-based learning method that treats a given entity and its mentions as a whole, regardless of the number of mentions in the training set. Specifically, our method uses a triplet loss-based function in conjunction with a clustering technique to learn the representation of mentions and entities. Through evaluations on two challenging biomedical datasets, i.e., MedMentions and BC5CDR, we show that our proposed method is able to address the issue of imbalanced data and to perform competitively with other state-of-the-art models. Moreover, our method significantly reduces computational cost in both training and inference steps. Our source code is publicly available here.
基于度量学习的生物医学实体链接方法
生物医学实体链接任务是将特定文本上下文中出现的提及映射到知识库(如统一医学语言系统(UMLS))中的唯一概念或实体。实体链接任务最具挑战性的方面之一是提及的模糊性,即:(1) 表面形式非常相似的提及,但在不同上下文中会映射到不同的实体;(2) 实体可以用不同类型的提及来表达。最近的研究使用基于 BERT 的编码器将提及和实体编码为可区分的表示形式,从而可以使用距离度量来测量它们的相似性。然而,现实世界中的大多数生物医学数据集都存在严重的不平衡问题,即某些类别有很多实例,而另一些类别在训练数据中只出现过一次或完全没有。解决这一问题的常用方法是对数据集进行缩减采样,即减少多数类别的实例数量,使数据集更加平衡。就实体链接而言,减少采样会降低模型全面学习不同语境中提及表征的能力,而这一点非常重要。为了解决这个问题,我们提出了一种基于度量的学习方法,这种方法将给定实体及其提及作为一个整体来处理,而不考虑训练集中提及的数量。具体来说,我们的方法使用基于三元组损失的函数,结合聚类技术来学习提及和实体的表示。通过在两个具有挑战性的生物医学数据集(即 MedMentions 和 BC5CDR)上进行评估,我们发现我们提出的方法能够解决不平衡数据的问题,并且与其他最先进的模型相比具有很强的竞争力。此外,我们的方法大大降低了训练和推理步骤的计算成本。我们的源代码在此公开。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.50
自引率
0.00%
发文量
0
审稿时长
14 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信