Biomedical Information Integration via Adaptive Large Language Model Construction.

IF 6.7 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Journal of Biomedical and Health Informatics Pub Date : 2024-11-11 DOI:10.1109/JBHI.2024.3496495

Xingsi Xue, Mu-En Wu, Fazlullah Khan

{"title":"Biomedical Information Integration via Adaptive Large Language Model Construction.","authors":"Xingsi Xue, Mu-En Wu, Fazlullah Khan","doi":"10.1109/JBHI.2024.3496495","DOIUrl":null,"url":null,"abstract":"<p><p>Integrating diverse biomedical knowledge information is essential to enhance the accuracy and efficiency of medical diagnoses, facilitate personalized treatment plans, and ultimately improve patient outcomes. However, Biomedical Information Integration (BII) faces significant challenges due to variations in terminology and the complex structure of entity descriptions across different datasets. A critical step in BII is biomedical entity alignment, which involves accurately identifying and matching equivalent entities across diverse datasets to ensure seamless data integration. In recent years, Large Language Model (LLMs), such as Bidirectional Encoder Representations from Transformers (BERTs), have emerged as valuable tools for discerning heterogeneous biomedical data due to their deep contextual embeddings and bidirectionality. However, different LLMs capture various nuances and complexity levels within the biomedical data, and none of them can ensure their effectiveness in all heterogeneous entity matching tasks. To address this issue, we propose a novel Two-Stage LLM construction (TSLLM) framework to adaptively select and combine LLMs for Biomedical Information Integration (BII). First, a Multi-Objective Genetic Programming (MOGP) algorithm is proposed for generating versatile high-level LLMs, and then, a Single-Objective Genetic Algorithm (SOGA) employs a confidence-based strategy is presented to combine the built LLMs, which can further improve the discriminative power of distinguishing heterogeneous entities. The experiment utilizes OAEI's entity matching datasets, i.e., Benchmark and Conference, along with LargeBio, Disease and Phenotype datasets to test the performance of TSLLM. The experimental findings validate the efficiency of TSLLM in adaptively differentiating heterogeneous biomedical entities, which significantly outperforms the leading entity matching techniques.</p>","PeriodicalId":13073,"journal":{"name":"IEEE Journal of Biomedical and Health Informatics","volume":"PP ","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Biomedical and Health Informatics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/JBHI.2024.3496495","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Integrating diverse biomedical knowledge information is essential to enhance the accuracy and efficiency of medical diagnoses, facilitate personalized treatment plans, and ultimately improve patient outcomes. However, Biomedical Information Integration (BII) faces significant challenges due to variations in terminology and the complex structure of entity descriptions across different datasets. A critical step in BII is biomedical entity alignment, which involves accurately identifying and matching equivalent entities across diverse datasets to ensure seamless data integration. In recent years, Large Language Model (LLMs), such as Bidirectional Encoder Representations from Transformers (BERTs), have emerged as valuable tools for discerning heterogeneous biomedical data due to their deep contextual embeddings and bidirectionality. However, different LLMs capture various nuances and complexity levels within the biomedical data, and none of them can ensure their effectiveness in all heterogeneous entity matching tasks. To address this issue, we propose a novel Two-Stage LLM construction (TSLLM) framework to adaptively select and combine LLMs for Biomedical Information Integration (BII). First, a Multi-Objective Genetic Programming (MOGP) algorithm is proposed for generating versatile high-level LLMs, and then, a Single-Objective Genetic Algorithm (SOGA) employs a confidence-based strategy is presented to combine the built LLMs, which can further improve the discriminative power of distinguishing heterogeneous entities. The experiment utilizes OAEI's entity matching datasets, i.e., Benchmark and Conference, along with LargeBio, Disease and Phenotype datasets to test the performance of TSLLM. The experimental findings validate the efficiency of TSLLM in adaptively differentiating heterogeneous biomedical entities, which significantly outperforms the leading entity matching techniques.

查看原文本刊更多论文

通过自适应大型语言模型构建实现生物医学信息整合。

整合不同的生物医学知识信息对于提高医疗诊断的准确性和效率、促进个性化治疗计划以及最终改善患者预后至关重要。然而，由于术语的差异和不同数据集实体描述结构的复杂性，生物医学信息集成（BII）面临着巨大的挑战。生物医学实体对齐是生物医学信息集成的关键步骤，它涉及准确识别和匹配不同数据集中的等效实体，以确保数据的无缝集成。近年来，大语言模型（LLM），如来自变换器的双向编码器表征（BERT），因其深度上下文嵌入和双向性，已成为辨别异构生物医学数据的重要工具。然而，不同的 LLMs 能捕捉到生物医学数据中的各种细微差别和复杂程度，没有一种 LLMs 能确保其在所有异构实体匹配任务中的有效性。为了解决这个问题，我们提出了一种新颖的两阶段 LLM 构建（Two-Stage LLM construction，TSLLM）框架，用于自适应地选择和组合 LLM，以实现生物医学信息集成（BII）。首先，我们提出了一种多目标遗传编程（MOGP）算法，用于生成通用的高级 LLM；然后，我们提出了一种单目标遗传算法（SOGA），采用基于置信度的策略来组合所构建的 LLM，从而进一步提高区分异质实体的能力。实验利用 OAEI 的实体匹配数据集（即 Benchmark 和 Conference）以及 LargeBio、Disease 和 Phenotype 数据集来测试 TSLLM 的性能。实验结果验证了 TSLLM 在自适应区分异构生物医学实体方面的效率，其性能明显优于领先的实体匹配技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal of Biomedical and Health Informatics COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

13.60

自引率

6.50%

发文量

1151

期刊介绍： IEEE Journal of Biomedical and Health Informatics publishes original papers presenting recent advances where information and communication technologies intersect with health, healthcare, life sciences, and biomedicine. Topics include acquisition, transmission, storage, retrieval, management, and analysis of biomedical and health information. The journal covers applications of information technologies in healthcare, patient monitoring, preventive care, early disease diagnosis, therapy discovery, and personalized treatment protocols. It explores electronic medical and health records, clinical information systems, decision support systems, medical and biological imaging informatics, wearable systems, body area/sensor networks, and more. Integration-related topics like interoperability, evidence-based medicine, and secure patient data are also addressed.