Mapping Chinese Medical Entities to the Unified Medical Language System.

Health data science Pub Date : 2023-03-30 eCollection Date: 2023-01-01 DOI:10.34133/hds.0011

Luming Chen, Yifan Qi, Aiping Wu, Lizong Deng, Taijiao Jiang

{"title":"Mapping Chinese Medical Entities to the Unified Medical Language System.","authors":"Luming Chen, Yifan Qi, Aiping Wu, Lizong Deng, Taijiao Jiang","doi":"10.34133/hds.0011","DOIUrl":null,"url":null,"abstract":"Background: Chinese medical entities have not been organized comprehensively due to the lack of well-developed terminology systems, which poses a challenge to processing Chinese medical texts for fine-grained medical knowledge representation. To unify Chinese medical terminologies, mapping Chinese medical entities to their English counterparts in the Unified Medical Language System (UMLS) is an efficient solution. However, their mappings have not been investigated sufficiently in former research. In this study, we explore strategies for mapping Chinese medical entities to the UMLS and systematically evaluate the mapping performance.Methods: First, Chinese medical entities are translated to English using multiple web-based translation engines. Then, 3 mapping strategies are investigated: (a) string-based, (b) semantic-based, and (c) string and semantic similarity combined. In addition, cross-lingual pretrained language models are applied to map Chinese medical entities to UMLS concepts without translation. All of these strategies are evaluated on the ICD10-CN, Chinese Human Phenotype Ontology (CHPO), and RealWorld datasets.Results: The linear combination method based on the SapBERT and term frequency-inverse document frequency bag-of-words models perform the best on all evaluation datasets, with 91.85%, 82.44%, and 78.43% of the top 5 accuracies on the ICD10-CN, CHPO, and RealWorld datasets, respectively.Conclusions: In our study, we explore strategies for mapping Chinese medical entities to the UMLS and identify a satisfactory linear combination method. Our investigation will facilitate Chinese medical entity normalization and inspire research that focuses on Chinese medical ontology development.","PeriodicalId":73207,"journal":{"name":"Health data science","volume":"1 1","pages":"0011"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10880171/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.34133/hds.0011","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Chinese medical entities have not been organized comprehensively due to the lack of well-developed terminology systems, which poses a challenge to processing Chinese medical texts for fine-grained medical knowledge representation. To unify Chinese medical terminologies, mapping Chinese medical entities to their English counterparts in the Unified Medical Language System (UMLS) is an efficient solution. However, their mappings have not been investigated sufficiently in former research. In this study, we explore strategies for mapping Chinese medical entities to the UMLS and systematically evaluate the mapping performance.

Methods: First, Chinese medical entities are translated to English using multiple web-based translation engines. Then, 3 mapping strategies are investigated: (a) string-based, (b) semantic-based, and (c) string and semantic similarity combined. In addition, cross-lingual pretrained language models are applied to map Chinese medical entities to UMLS concepts without translation. All of these strategies are evaluated on the ICD10-CN, Chinese Human Phenotype Ontology (CHPO), and RealWorld datasets.

Results: The linear combination method based on the SapBERT and term frequency-inverse document frequency bag-of-words models perform the best on all evaluation datasets, with 91.85%, 82.44%, and 78.43% of the top 5 accuracies on the ICD10-CN, CHPO, and RealWorld datasets, respectively.

Conclusions: In our study, we explore strategies for mapping Chinese medical entities to the UMLS and identify a satisfactory linear combination method. Our investigation will facilitate Chinese medical entity normalization and inspire research that focuses on Chinese medical ontology development.

Abstract Image

查看原文本刊更多论文

中文医学实体到统一医学语言系统的映射

背景：由于缺乏完善的术语系统，中文医学实体尚未得到全面整理，这给处理中文医学文本以进行精细医学知识表征带来了挑战。为了统一中文医学术语，将中文医学实体映射到统一医学语言系统（UMLS）中的英文对应实体是一个有效的解决方案。然而，以往的研究并未对其映射进行充分研究。在本研究中，我们探索了将中文医学实体映射到 UMLS 的策略，并对映射性能进行了系统评估：方法：首先，使用多个网络翻译引擎将中文医学实体翻译成英文。方法：首先，使用多个基于网络的翻译引擎将中文医疗实体翻译成英文，然后研究 3 种映射策略：(a) 基于字符串，(b) 基于语义，(c) 结合字符串和语义相似性。此外，还应用了跨语言预训练语言模型，在不翻译的情况下将中文医学实体映射到 UMLS 概念。所有这些策略都在 ICD10-CN、Chinese Human Phenotype Ontology (CHPO) 和 RealWorld 数据集上进行了评估：基于 SapBERT 和词频-反文档频率词袋模型的线性组合方法在所有评估数据集上表现最佳，在 ICD10-CN、CHPO 和 RealWorld 数据集上的前 5 名准确率分别为 91.85%、82.44% 和 78.43%：在我们的研究中，我们探索了将中医实体映射到 UMLS 的策略，并确定了一种令人满意的线性组合方法。我们的研究将促进中医实体的规范化，并对专注于中医本体开发的研究有所启发。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Health data science

CiteScore

3.70

自引率

0.00%

发文量