MedBot vs RealDoc: efficacy of large language modeling in physician-patient communication for rare diseases.

IF 4.6 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of the American Medical Informatics Association Pub Date : 2025-05-01 DOI:10.1093/jamia/ocaf034

Magdalena T Weber, Richard Noll, Alexandra Marchl, Carlo Facchinello, Achim Grünewaldt, Christian Hügel, Khader Musleh, Thomas O F Wagner, Holger Storf, Jannik Schaaf

{"title":"MedBot vs RealDoc: efficacy of large language modeling in physician-patient communication for rare diseases.","authors":"Magdalena T Weber, Richard Noll, Alexandra Marchl, Carlo Facchinello, Achim Grünewaldt, Christian Hügel, Khader Musleh, Thomas O F Wagner, Holger Storf, Jannik Schaaf","doi":"10.1093/jamia/ocaf034","DOIUrl":null,"url":null,"abstract":"Objectives: This study assesses the abilities of 2 large language models (LLMs), GPT-4 and BioMistral 7B, in responding to patient queries, particularly concerning rare diseases, and compares their performance with that of physicians.Materials and methods: A total of 103 patient queries and corresponding physician answers were extracted from EXABO, a question-answering forum dedicated to rare respiratory diseases. The responses provided by physicians and generated by LLMs were ranked on a Likert scale by a panel of 4 experts based on 4 key quality criteria for health communication: correctness, comprehensibility, relevance, and empathy.Results: The performance of generative pretrained transformer 4 (GPT-4) was significantly better than the performance of the physicians and BioMistral 7B. While the overall ranking considers GPT-4's responses to be mostly correct, comprehensive, relevant, and emphatic, the responses provided by BioMistral 7B were only partially correct and empathetic. The responses given by physicians rank in between. The experts concur that an LLM could lighten the load for physicians, rigorous validation is considered essential to guarantee dependability and efficacy.Discussion: Open-source models such as BioMistral 7B offer the advantage of privacy by running locally in health-care settings. GPT-4, on the other hand, demonstrates proficiency in communication and knowledge depth. However, challenges persist, including the management of response variability, the balancing of comprehensibility with medical accuracy, and the assurance of consistent performance across different languages.Conclusion: The performance of GPT-4 underscores the potential of LLMs in facilitating physician-patient communication. However, it is imperative that these systems are handled with care, as erroneous responses have the potential to cause harm without the requisite validation procedures.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":"775-783"},"PeriodicalIF":4.6000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12012358/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf034","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: This study assesses the abilities of 2 large language models (LLMs), GPT-4 and BioMistral 7B, in responding to patient queries, particularly concerning rare diseases, and compares their performance with that of physicians.

Materials and methods: A total of 103 patient queries and corresponding physician answers were extracted from EXABO, a question-answering forum dedicated to rare respiratory diseases. The responses provided by physicians and generated by LLMs were ranked on a Likert scale by a panel of 4 experts based on 4 key quality criteria for health communication: correctness, comprehensibility, relevance, and empathy.

Results: The performance of generative pretrained transformer 4 (GPT-4) was significantly better than the performance of the physicians and BioMistral 7B. While the overall ranking considers GPT-4's responses to be mostly correct, comprehensive, relevant, and emphatic, the responses provided by BioMistral 7B were only partially correct and empathetic. The responses given by physicians rank in between. The experts concur that an LLM could lighten the load for physicians, rigorous validation is considered essential to guarantee dependability and efficacy.

Discussion: Open-source models such as BioMistral 7B offer the advantage of privacy by running locally in health-care settings. GPT-4, on the other hand, demonstrates proficiency in communication and knowledge depth. However, challenges persist, including the management of response variability, the balancing of comprehensibility with medical accuracy, and the assurance of consistent performance across different languages.

Conclusion: The performance of GPT-4 underscores the potential of LLMs in facilitating physician-patient communication. However, it is imperative that these systems are handled with care, as erroneous responses have the potential to cause harm without the requisite validation procedures.

Abstract Image

查看原文本刊更多论文

MedBot vs RealDoc：大语言建模在罕见疾病医患沟通中的功效

目的：本研究评估了两种大型语言模型（LLMs） GPT-4和BioMistral 7B在回答患者询问（特别是罕见疾病）方面的能力，并将其与医生的表现进行了比较。材料和方法：从EXABO（一个致力于罕见呼吸系统疾病的问答论坛）中提取103例患者提问和相应的医生回答。由医生提供并由法学硕士生成的回答由4位专家组成的小组根据健康沟通的4个关键质量标准（正确性、可理解性、相关性和移情）在李克特量表上进行排名。结果：生成预训练变压器4 （GPT-4）的表现明显优于医生和biomistal 7B的表现。尽管总体排名认为GPT-4的回答大部分是正确的、全面的、相关的和强调的，但BioMistral 7B提供的回答只有部分正确和同理心。医生给出的回答介于两者之间。专家们一致认为，法学硕士学位可以减轻医生的负担，严格的验证被认为是保证可靠性和有效性的必要条件。讨论：诸如BioMistral 7B之类的开源模型通过在医疗保健机构本地运行提供了隐私方面的优势。而GPT-4则体现了沟通能力和知识深度。然而，挑战依然存在，包括应对变异性的管理、可理解性与医疗准确性之间的平衡，以及在不同语言之间保证一致的表现。结论：GPT-4的表现强调了LLMs在促进医患沟通方面的潜力。然而，必须小心处理这些系统，因为错误的响应有可能在没有必要的验证程序的情况下造成伤害。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.