Performance of large language models on advocating the management of meningitis: a comparative qualitative stud

IF 4.1 Q1 HEALTH CARE SCIENCES & SERVICES

BMJ Health & Care Informatics Pub Date : 2024-02-01 DOI:10.1136/bmjhci-2023-100978

Urs Fisch, Paulina Kliem, Pascale Grzonka, Raoul Sutter

{"title":"Performance of large language models on advocating the management of meningitis: a comparative qualitative stud","authors":"Urs Fisch, Paulina Kliem, Pascale Grzonka, Raoul Sutter","doi":"10.1136/bmjhci-2023-100978","DOIUrl":null,"url":null,"abstract":"Objectives We aimed to examine the adherence of large language models (LLMs) to bacterial meningitis guidelines using a hypothetical medical case, highlighting their utility and limitations in healthcare. Methods A simulated clinical scenario of a patient with bacterial meningitis secondary to mastoiditis was presented in three independent sessions to seven publicly accessible LLMs (Bard, Bing, Claude-2, GTP-3.5, GTP-4, Llama, PaLM). Responses were evaluated for adherence to good clinical practice and two international meningitis guidelines. Results A central nervous system infection was identified in 90% of LLM sessions. All recommended imaging, while 81% suggested lumbar puncture. Blood cultures and specific mastoiditis work-up were proposed in only 62% and 38% sessions, respectively. Only 38% of sessions provided the correct empirical antibiotic treatment, while antiviral treatment and dexamethasone were advised in 33% and 24%, respectively. Misleading statements were generated in 52%. No significant correlation was found between LLMs’ text length and performance (r=0.29, p=0.20). Among all LLMs, GTP-4 demonstrated the best performance. Discussion Latest LLMs provide valuable advice on differential diagnosis and diagnostic procedures but significantly vary in treatment-specific information for bacterial meningitis when introduced to a realistic clinical scenario. Misleading statements were common, with performance differences attributed to each LLM’s unique algorithm rather than output length. Conclusions Users must be aware of such limitations and performance variability when considering LLMs as a support tool for medical decision-making. Further research is needed to refine these models' comprehension of complex medical scenarios and their ability to provide reliable information. Data are available upon reasonable request.","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"6 1","pages":""},"PeriodicalIF":4.1000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2023-100978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives We aimed to examine the adherence of large language models (LLMs) to bacterial meningitis guidelines using a hypothetical medical case, highlighting their utility and limitations in healthcare. Methods A simulated clinical scenario of a patient with bacterial meningitis secondary to mastoiditis was presented in three independent sessions to seven publicly accessible LLMs (Bard, Bing, Claude-2, GTP-3.5, GTP-4, Llama, PaLM). Responses were evaluated for adherence to good clinical practice and two international meningitis guidelines. Results A central nervous system infection was identified in 90% of LLM sessions. All recommended imaging, while 81% suggested lumbar puncture. Blood cultures and specific mastoiditis work-up were proposed in only 62% and 38% sessions, respectively. Only 38% of sessions provided the correct empirical antibiotic treatment, while antiviral treatment and dexamethasone were advised in 33% and 24%, respectively. Misleading statements were generated in 52%. No significant correlation was found between LLMs’ text length and performance (r=0.29, p=0.20). Among all LLMs, GTP-4 demonstrated the best performance. Discussion Latest LLMs provide valuable advice on differential diagnosis and diagnostic procedures but significantly vary in treatment-specific information for bacterial meningitis when introduced to a realistic clinical scenario. Misleading statements were common, with performance differences attributed to each LLM’s unique algorithm rather than output length. Conclusions Users must be aware of such limitations and performance variability when considering LLMs as a support tool for medical decision-making. Further research is needed to refine these models' comprehension of complex medical scenarios and their ability to provide reliable information. Data are available upon reasonable request.

查看原文本刊更多论文

大语言模型在脑膜炎管理宣传方面的表现：一项定性比较研究

目的我们旨在通过一个假设的医疗案例来检验大型语言模型（LLMs）对细菌性脑膜炎指南的遵从情况，从而突出其在医疗保健领域的实用性和局限性。方法将一个继发于乳突炎的细菌性脑膜炎患者的模拟临床情景分三次展示给七个可公开访问的大型语言模型（Bard、Bing、Claude-2、GTP-3.5、GTP-4、Llama、PaLM）。根据良好临床实践和两份国际脑膜炎指南，对回复进行了评估。结果 90% 的 LLM 会议确定了中枢神经系统感染。所有人都建议进行影像学检查，81%的人建议进行腰椎穿刺。分别只有 62% 和 38% 的会议建议进行血液培养和特定乳突炎检查。只有 38% 的会议提供了正确的经验性抗生素治疗，而分别有 33% 和 24% 的会议建议进行抗病毒治疗和地塞米松治疗。有 52% 的陈述具有误导性。结果表明，语言学习者的文字长度与学习成绩之间没有明显的相关性（r=0.29，p=0.20）。在所有 LLM 中，GTP-4 的性能最佳。讨论最新的 LLM 在鉴别诊断和诊断程序方面提供了有价值的建议，但在引入真实的临床场景时，在细菌性脑膜炎的治疗特异性信息方面存在显著差异。误导性陈述很常见，性能差异归因于每个 LLM 的独特算法而非输出长度。结论用户在考虑将 LLM 作为医疗决策支持工具时，必须意识到这些局限性和性能差异。还需要进一步的研究来完善这些模型对复杂医疗场景的理解能力以及提供可靠信息的能力。如有合理要求，可提供相关数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊