使用域内适应性 BERT 模型和分类层对多语言症状实体进行识别和规范化。

IF 3.4 4区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Database: The Journal of Biological Databases and Curation Pub Date : 2024-08-28 DOI:10.1093/database/baae087

Fernando Gallego, Francisco J Veredas

{"title":"使用域内适应性 BERT 模型和分类层对多语言症状实体进行识别和规范化。","authors":"Fernando Gallego, Francisco J Veredas","doi":"10.1093/database/baae087","DOIUrl":null,"url":null,"abstract":"Due to the scarcity of available annotations in the biomedical domain, clinical natural language processing poses a substantial challenge, especially when applied to low-resource languages. This paper presents our contributions for the detection and normalization of clinical entities corresponding to symptoms, signs, and findings present in multilingual clinical texts. For this purpose, the three subtasks proposed in the SympTEMIST shared task of the Biocreative VIII conference have been addressed. For Subtask 1-named entity recognition in a Spanish corpus-an approach focused on BERT-based model assemblies pretrained on a proprietary oncology corpus was followed. Subtasks 2 and 3 of SympTEMIST address named entity linking (NEL) in Spanish and multilingual corpora, respectively. Our approach to these subtasks followed a classification strategy that starts from a bi-encoder trained by contrastive learning, for which several SapBERT-like models are explored. To apply this NEL approach to different languages, we have trained these models by leveraging the knowledge base of domain-specific medical concepts in Spanish supplied by the organizers, which we have translated into the other languages of interest by using machine translation tools. The results obtained in the three subtasks establish a new state of the art. Thus, for Subtask 1 we obtain precision results of 0.804, F1-score of 0.748, and recall of 0.699. For Subtask 2, we obtain performance gains of up to 5.5% in top-1 accuracy when the trained bi-encoder is followed by a WNT-softmax classification layer that is initialized with the mean of the embeddings of a subset of SNOMED-CT terms. For Subtask 3, the differences are even more pronounced, and our multilingual bi-encoder outperforms the other models analyzed in all languages except Swedish when combined with a WNT-softmax classification layer. Thus, the improvements in top-1 accuracy over the best bi-encoder model alone are 13% for Portuguese and 13.26% for Swedish. Database URL: https://doi.org/10.1093/database/baae087.","PeriodicalId":10923,"journal":{"name":"Database: The Journal of Biological Databases and Curation","volume":"2024 ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11352596/pdf/","citationCount":"0","resultStr":"{\"title\":\"Recognition and normalization of multilingual symptom entities using in-domain-adapted BERT models and classification layers.\",\"authors\":\"Fernando Gallego, Francisco J Veredas\",\"doi\":\"10.1093/database/baae087\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the scarcity of available annotations in the biomedical domain, clinical natural language processing poses a substantial challenge, especially when applied to low-resource languages. This paper presents our contributions for the detection and normalization of clinical entities corresponding to symptoms, signs, and findings present in multilingual clinical texts. For this purpose, the three subtasks proposed in the SympTEMIST shared task of the Biocreative VIII conference have been addressed. For Subtask 1-named entity recognition in a Spanish corpus-an approach focused on BERT-based model assemblies pretrained on a proprietary oncology corpus was followed. Subtasks 2 and 3 of SympTEMIST address named entity linking (NEL) in Spanish and multilingual corpora, respectively. Our approach to these subtasks followed a classification strategy that starts from a bi-encoder trained by contrastive learning, for which several SapBERT-like models are explored. To apply this NEL approach to different languages, we have trained these models by leveraging the knowledge base of domain-specific medical concepts in Spanish supplied by the organizers, which we have translated into the other languages of interest by using machine translation tools. The results obtained in the three subtasks establish a new state of the art. Thus, for Subtask 1 we obtain precision results of 0.804, F1-score of 0.748, and recall of 0.699. For Subtask 2, we obtain performance gains of up to 5.5% in top-1 accuracy when the trained bi-encoder is followed by a WNT-softmax classification layer that is initialized with the mean of the embeddings of a subset of SNOMED-CT terms. For Subtask 3, the differences are even more pronounced, and our multilingual bi-encoder outperforms the other models analyzed in all languages except Swedish when combined with a WNT-softmax classification layer. Thus, the improvements in top-1 accuracy over the best bi-encoder model alone are 13% for Portuguese and 13.26% for Swedish. Database URL: https://doi.org/10.1093/database/baae087.\",\"PeriodicalId\":10923,\"journal\":{\"name\":\"Database: The Journal of Biological Databases and Curation\",\"volume\":\"2024 \",\"pages\":\"\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11352596/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Database: The Journal of Biological Databases and Curation\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/database/baae087\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Database: The Journal of Biological Databases and Curation","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/database/baae087","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

由于生物医学领域可用注释的匮乏，临床自然语言处理面临着巨大的挑战，尤其是在应用于低资源语言时。本文介绍了我们在多语言临床文本中与症状、体征和检查结果相对应的临床实体的检测和规范化方面做出的贡献。为此，我们讨论了第八届生物创新大会 SympTEMIST 共享任务中提出的三个子任务。对于子任务 1--西班牙文语料库中的命名实体识别，采用的方法是基于 BERT 的模型集合，并在专有肿瘤学语料库中进行了预训练。SympTEMIST 的子任务 2 和 3 分别涉及西班牙语和多语言语料库中的命名实体连接 (NEL)。我们在这些子任务中采用的分类策略是从对比学习训练的双编码器开始的，并为此探索了几种类似于 SapBERT 的模型。为了将这种 NEL 方法应用于不同的语言，我们利用主办方提供的西班牙语特定领域医学概念知识库来训练这些模型，并使用机器翻译工具将其翻译成其他相关语言。在三个子任务中获得的结果确立了新的技术水平。因此，在子任务 1 中，我们获得了 0.804 的精确度、0.748 的 F1 分数和 0.699 的召回率。对于子任务 2，如果在训练好的双编码器之后再加上一个以 SNOMED-CT 术语子集嵌入平均值为初始化的 WNT-softmax 分类层，我们就能获得高达 5.5% 的 top-1 准确率。在子任务 3 中，差异更加明显，我们的多语言双编码器在与 WNT-softmax 分类层相结合时，除瑞典语外，在所有语言中的表现都优于其他分析模型。因此，与单独的最佳双编码器模型相比，葡萄牙语和瑞典语的top-1准确率分别提高了13%和13.26%。数据库网址：https://doi.org/10.1093/database/baae087.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Recognition and normalization of multilingual symptom entities using in-domain-adapted BERT models and classification layers.

Due to the scarcity of available annotations in the biomedical domain, clinical natural language processing poses a substantial challenge, especially when applied to low-resource languages. This paper presents our contributions for the detection and normalization of clinical entities corresponding to symptoms, signs, and findings present in multilingual clinical texts. For this purpose, the three subtasks proposed in the SympTEMIST shared task of the Biocreative VIII conference have been addressed. For Subtask 1-named entity recognition in a Spanish corpus-an approach focused on BERT-based model assemblies pretrained on a proprietary oncology corpus was followed. Subtasks 2 and 3 of SympTEMIST address named entity linking (NEL) in Spanish and multilingual corpora, respectively. Our approach to these subtasks followed a classification strategy that starts from a bi-encoder trained by contrastive learning, for which several SapBERT-like models are explored. To apply this NEL approach to different languages, we have trained these models by leveraging the knowledge base of domain-specific medical concepts in Spanish supplied by the organizers, which we have translated into the other languages of interest by using machine translation tools. The results obtained in the three subtasks establish a new state of the art. Thus, for Subtask 1 we obtain precision results of 0.804, F1-score of 0.748, and recall of 0.699. For Subtask 2, we obtain performance gains of up to 5.5% in top-1 accuracy when the trained bi-encoder is followed by a WNT-softmax classification layer that is initialized with the mean of the embeddings of a subset of SNOMED-CT terms. For Subtask 3, the differences are even more pronounced, and our multilingual bi-encoder outperforms the other models analyzed in all languages except Swedish when combined with a WNT-softmax classification layer. Thus, the improvements in top-1 accuracy over the best bi-encoder model alone are 13% for Portuguese and 13.26% for Swedish. Database URL: https://doi.org/10.1093/database/baae087.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Database: The Journal of Biological Databases and Curation MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

9.00

自引率

3.40%

发文量

100

审稿时长

>12 weeks

期刊介绍： Huge volumes of primary data are archived in numerous open-access databases, and with new generation technologies becoming more common in laboratories, large datasets will become even more prevalent. The archiving, curation, analysis and interpretation of all of these data are a challenge. Database development and biocuration are at the forefront of the endeavor to make sense of this mounting deluge of data. Database: The Journal of Biological Databases and Curation provides an open access platform for the presentation of novel ideas in database research and biocuration, and aims to help strengthen the bridge between database developers, curators, and users.