A Novel Evaluation Framework for Medical LLMs: Combining Fuzzy Logic and MCDM for Medical Relation and Clinical Concept Extraction.

IF 3.5 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Systems Pub Date : 2024-08-31 DOI:10.1007/s10916-024-02090-y

A H Alamoodi, Omar Zughoul, Dianese David, Salem Garfan, Dragan Pamucar, O S Albahri, A S Albahri, Salman Yussof, Iman Mohamad Sharaf

{"title":"A Novel Evaluation Framework for Medical LLMs: Combining Fuzzy Logic and MCDM for Medical Relation and Clinical Concept Extraction.","authors":"A H Alamoodi, Omar Zughoul, Dianese David, Salem Garfan, Dragan Pamucar, O S Albahri, A S Albahri, Salman Yussof, Iman Mohamad Sharaf","doi":"10.1007/s10916-024-02090-y","DOIUrl":null,"url":null,"abstract":"<p><p>Artificial intelligence (AI) has become a crucial element of modern technology, especially in the healthcare sector, which is apparent given the continuous development of large language models (LLMs), which are utilized in various domains, including medical beings. However, when it comes to using these LLMs for the medical domain, there's a need for an evaluation platform to determine their suitability and drive future development efforts. Towards that end, this study aims to address this concern by developing a comprehensive Multi-Criteria Decision Making (MCDM) approach that is specifically designed to evaluate medical LLMs. The success of AI, particularly LLMs, in the healthcare domain, depends on their efficacy, safety, and ethical compliance. Therefore, it is essential to have a robust evaluation framework for their integration into medical contexts. This study proposes using the Fuzzy-Weighted Zero-InConsistency (FWZIC) method extended to p, q-quasirung orthopair fuzzy set (p, q-QROFS) for weighing evaluation criteria. This extension enables the handling of uncertainties inherent in medical decision-making processes. The approach accommodates the imprecise and multifaceted nature of real-world medical data and criteria by incorporating fuzzy logic principles. The MultiAtributive Ideal-Real Comparative Analysis (MAIRCA) method is employed for the assessment of medical LLMs utilized in the case study of this research. The results of this research revealed that \"Medical Relation Extraction\" criteria with its sub-levels had more importance with (0.504) than \"Clinical Concept Extraction\" with (0.495). For the LLMs evaluated, out of 6 alternatives, ( <math><mrow><mi>A</mi> <mn>4</mn></mrow> </math> ) \"GatorTron S 10B\" had the 1st rank as compared to ( <math><mrow><mi>A</mi> <mn>1</mn></mrow> </math> ) \"GatorTron 90B\" had the 6th rank. The implications of this study extend beyond academic discourse, directly impacting healthcare practices and patient outcomes. The proposed framework can help healthcare professionals make more informed decisions regarding the adoption and utilization of LLMs in medical settings.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"48 1","pages":"81"},"PeriodicalIF":3.5000,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-024-02090-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Artificial intelligence (AI) has become a crucial element of modern technology, especially in the healthcare sector, which is apparent given the continuous development of large language models (LLMs), which are utilized in various domains, including medical beings. However, when it comes to using these LLMs for the medical domain, there's a need for an evaluation platform to determine their suitability and drive future development efforts. Towards that end, this study aims to address this concern by developing a comprehensive Multi-Criteria Decision Making (MCDM) approach that is specifically designed to evaluate medical LLMs. The success of AI, particularly LLMs, in the healthcare domain, depends on their efficacy, safety, and ethical compliance. Therefore, it is essential to have a robust evaluation framework for their integration into medical contexts. This study proposes using the Fuzzy-Weighted Zero-InConsistency (FWZIC) method extended to p, q-quasirung orthopair fuzzy set (p, q-QROFS) for weighing evaluation criteria. This extension enables the handling of uncertainties inherent in medical decision-making processes. The approach accommodates the imprecise and multifaceted nature of real-world medical data and criteria by incorporating fuzzy logic principles. The MultiAtributive Ideal-Real Comparative Analysis (MAIRCA) method is employed for the assessment of medical LLMs utilized in the case study of this research. The results of this research revealed that "Medical Relation Extraction" criteria with its sub-levels had more importance with (0.504) than "Clinical Concept Extraction" with (0.495). For the LLMs evaluated, out of 6 alternatives, ( $A 4$ ) "GatorTron S 10B" had the 1st rank as compared to ( $A 1$ ) "GatorTron 90B" had the 6th rank. The implications of this study extend beyond academic discourse, directly impacting healthcare practices and patient outcomes. The proposed framework can help healthcare professionals make more informed decisions regarding the adoption and utilization of LLMs in medical settings.

查看原文本刊更多论文

医学 LLM 的新型评估框架：结合模糊逻辑和 MCDM 以提取医学关系和临床概念

人工智能（AI）已成为现代科技的重要组成部分，尤其是在医疗保健领域，这一点从大型语言模型（LLM）的不断发展中就能明显看出，这些模型被广泛应用于包括医疗在内的各个领域。然而，在医疗领域使用这些 LLMs 时，需要一个评估平台来确定其适用性并推动未来的开发工作。为此，本研究旨在通过开发一种专门用于评估医学 LLM 的综合多标准决策（MCDM）方法来解决这一问题。人工智能（尤其是 LLM）在医疗保健领域的成功取决于其有效性、安全性和伦理合规性。因此，必须有一个强大的评估框架，以便将其融入医疗环境。本研究建议使用模糊加权零不一致（FWZIC）方法扩展到 p, q-quasirung orthopair 模糊集（p, q-QROFS）来权衡评价标准。这一扩展可处理医疗决策过程中固有的不确定性。这种方法通过结合模糊逻辑原理，适应了现实世界中医疗数据和标准的不精确性和多面性。在本研究的案例研究中，采用了多分配理想-真实比较分析（MAIRCA）方法来评估医学 LLM。研究结果显示，"医学关系提取 "标准及其子级别的重要性（0.504）高于 "临床概念提取 "标准的重要性（0.495）。就所评估的 LLM 而言，在 6 个备选方案中，（A 4）"GatorTron S 10B "排名第一，而（A 1）"GatorTron 90B "排名第六。本研究的意义超出了学术讨论的范围，直接影响到医疗实践和患者的治疗效果。所提出的框架可以帮助医护人员在医疗环境中采用和使用 LLM 时做出更明智的决定。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Systems 医学-卫生保健

CiteScore

11.60

自引率

1.90%

发文量

审稿时长

4.8 months

期刊介绍： Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.