神经成像临床决策支持中大型语言模型实用性的比较评估。

Journal of imaging informatics in medicine Pub Date : 2024-11-07 DOI:10.1007/s10278-024-01161-3

Luke Miller, Peter Kamel, Jigar Patel, Jay Agrawal, Min Zhan, Nathan Bumbarger, Kenneth Wang

{"title":"神经成像临床决策支持中大型语言模型实用性的比较评估。","authors":"Luke Miller, Peter Kamel, Jigar Patel, Jay Agrawal, Min Zhan, Nathan Bumbarger, Kenneth Wang","doi":"10.1007/s10278-024-01161-3","DOIUrl":null,"url":null,"abstract":"Imaging utilization has increased dramatically in recent years, and at least some of these studies are not appropriate for the clinical scenario. The development of large language models (LLMs) may address this issue by providing a more accessible reference resource for ordering providers, but their relative performance is currently understudied. Evaluate and compare the relative appropriateness and usefulness of imaging recommendations generated by eight publicly available models in response to neuroradiology clinical scenarios. Twenty-four common neuroradiology clinical scenarios were selected which often yield suboptimal imaging utilization. Questions were crafted to assess the ability of LLMs to provide accurate and actionable advice. The LLMs were assessed in August 2023 using natural-language 1-2 sentence queries requesting advice about optimal image ordering given certain clinical parameters. Eight of the most well-known LLMs were chosen for evaluation: ChatGPT, GPT4, Bard (Versions 1 and 2), Bing Chat, Llama 2, Perplexity, and Claude. The models were graded by three fellowship-trained neuroradiologists on whether their advice was \"optimal\" or \"not optimal\" according to the ACR Appropriateness Criteria or the New Orleans Head CT Criteria. The raters also ranked the models based on the appropriateness, helpfulness, concision, and source-citations in their response. The models varied in their ability to deliver an \"optimal\" recommendation based on these scenarios as follows: ChatGPT (20/24), GPT4 (23/24), Bard 1 (13/24), Bard 2 (14/24), Bing Chat (14/24), Llama (5/24), Perplexity (19/24), and Claude (19/24). The median ranks of the LLMs were as follows: ChatGPT (3), GPT4 (1.5), Bard 1 (4.5), Bard 2 (5), Bing Chat (6), Llama (7.5), Perplexity (4), and Claude (3). Characteristic errors are described and discussed. GPT-4, ChatGPT, and Claude generally outperformed Bard, Bing Chat, and Llama 2. This study evaluates the performance of a greater variety of publicly available LLMs in settings that more closely mimic real-world use cases as well as discussing the practical challenges of doing so. This is the first study to evaluate and compare a wide range of publicly available LLMs to determine appropriateness of their neuroradiology imaging recommendations.","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.\",\"authors\":\"Luke Miller, Peter Kamel, Jigar Patel, Jay Agrawal, Min Zhan, Nathan Bumbarger, Kenneth Wang\",\"doi\":\"10.1007/s10278-024-01161-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Imaging utilization has increased dramatically in recent years, and at least some of these studies are not appropriate for the clinical scenario. The development of large language models (LLMs) may address this issue by providing a more accessible reference resource for ordering providers, but their relative performance is currently understudied. Evaluate and compare the relative appropriateness and usefulness of imaging recommendations generated by eight publicly available models in response to neuroradiology clinical scenarios. Twenty-four common neuroradiology clinical scenarios were selected which often yield suboptimal imaging utilization. Questions were crafted to assess the ability of LLMs to provide accurate and actionable advice. The LLMs were assessed in August 2023 using natural-language 1-2 sentence queries requesting advice about optimal image ordering given certain clinical parameters. Eight of the most well-known LLMs were chosen for evaluation: ChatGPT, GPT4, Bard (Versions 1 and 2), Bing Chat, Llama 2, Perplexity, and Claude. The models were graded by three fellowship-trained neuroradiologists on whether their advice was \\\"optimal\\\" or \\\"not optimal\\\" according to the ACR Appropriateness Criteria or the New Orleans Head CT Criteria. The raters also ranked the models based on the appropriateness, helpfulness, concision, and source-citations in their response. The models varied in their ability to deliver an \\\"optimal\\\" recommendation based on these scenarios as follows: ChatGPT (20/24), GPT4 (23/24), Bard 1 (13/24), Bard 2 (14/24), Bing Chat (14/24), Llama (5/24), Perplexity (19/24), and Claude (19/24). The median ranks of the LLMs were as follows: ChatGPT (3), GPT4 (1.5), Bard 1 (4.5), Bard 2 (5), Bing Chat (6), Llama (7.5), Perplexity (4), and Claude (3). Characteristic errors are described and discussed. GPT-4, ChatGPT, and Claude generally outperformed Bard, Bing Chat, and Llama 2. This study evaluates the performance of a greater variety of publicly available LLMs in settings that more closely mimic real-world use cases as well as discussing the practical challenges of doing so. This is the first study to evaluate and compare a wide range of publicly available LLMs to determine appropriateness of their neuroradiology imaging recommendations.\",\"PeriodicalId\":516858,\"journal\":{\"name\":\"Journal of imaging informatics in medicine\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of imaging informatics in medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s10278-024-01161-3\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-024-01161-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近年来，成像利用率急剧上升，其中至少有一些研究并不适合临床情况。大型语言模型（LLM）的开发可能会解决这一问题，为下单的医疗服务提供者提供更方便的参考资源，但目前对其相对性能的研究还不充分。针对神经放射学临床场景，评估并比较由八个公开可用的模型生成的成像建议的相对适当性和实用性。我们选择了 24 种常见的神经放射学临床场景，这些场景通常会导致成像利用率不达标。我们设计了一些问题来评估 LLM 提供准确和可行建议的能力。2023 年 8 月，我们使用 1-2 句自然语言查询对 LLM 进行了评估，查询内容是根据某些临床参数对最佳影像排序提出建议。我们选择了八种最知名的 LLM 进行评估：ChatGPT、GPT4、Bard（版本 1 和 2）、Bing Chat、Llama 2、Perplexity 和 Claude。根据 ACR 适宜性标准或新奥尔良头颅 CT 标准，由三位接受过研究培训的神经放射学专家对这些模型的建议是 "最佳 "还是 "非最佳 "进行评分。评分者还根据模型答复的适当性、有用性、简洁性和引用来源进行了排名。根据这些情况，模型提供 "最佳 "建议的能力各不相同，具体如下：ChatGPT (20/24)、GPT4 (23/24)、Bard 1 (13/24)、Bard 2 (14/24)、Bing Chat (14/24)、Llama (5/24)、Perplexity (19/24) 和 Claude (19/24)。LLM 的排名中位数如下：ChatGPT（3）、Perplexity（19/24）和 Clude（19/24）：ChatGPT (3)、GPT4 (1.5)、Bard 1 (4.5)、Bard 2 (5)、Bing Chat (6)、Llama (7.5)、Perplexity (4) 和 Claude (3)。对特征性错误进行了描述和讨论。GPT-4、ChatGPT 和 Claude 的性能普遍优于 Bard、Bing Chat 和 Llama 2。本研究评估了更多公开可用的 LLM 在更接近真实世界用例的环境中的性能，并讨论了这样做的实际挑战。这是第一项评估和比较各种公开可用的 LLM，以确定其神经放射成像建议是否合适的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Comparative Evaluation of Large Language Model Utility in Neuroimaging Clinical Decision Support.

Imaging utilization has increased dramatically in recent years, and at least some of these studies are not appropriate for the clinical scenario. The development of large language models (LLMs) may address this issue by providing a more accessible reference resource for ordering providers, but their relative performance is currently understudied. Evaluate and compare the relative appropriateness and usefulness of imaging recommendations generated by eight publicly available models in response to neuroradiology clinical scenarios. Twenty-four common neuroradiology clinical scenarios were selected which often yield suboptimal imaging utilization. Questions were crafted to assess the ability of LLMs to provide accurate and actionable advice. The LLMs were assessed in August 2023 using natural-language 1-2 sentence queries requesting advice about optimal image ordering given certain clinical parameters. Eight of the most well-known LLMs were chosen for evaluation: ChatGPT, GPT4, Bard (Versions 1 and 2), Bing Chat, Llama 2, Perplexity, and Claude. The models were graded by three fellowship-trained neuroradiologists on whether their advice was "optimal" or "not optimal" according to the ACR Appropriateness Criteria or the New Orleans Head CT Criteria. The raters also ranked the models based on the appropriateness, helpfulness, concision, and source-citations in their response. The models varied in their ability to deliver an "optimal" recommendation based on these scenarios as follows: ChatGPT (20/24), GPT4 (23/24), Bard 1 (13/24), Bard 2 (14/24), Bing Chat (14/24), Llama (5/24), Perplexity (19/24), and Claude (19/24). The median ranks of the LLMs were as follows: ChatGPT (3), GPT4 (1.5), Bard 1 (4.5), Bard 2 (5), Bing Chat (6), Llama (7.5), Perplexity (4), and Claude (3). Characteristic errors are described and discussed. GPT-4, ChatGPT, and Claude generally outperformed Bard, Bing Chat, and Llama 2. This study evaluates the performance of a greater variety of publicly available LLMs in settings that more closely mimic real-world use cases as well as discussing the practical challenges of doing so. This is the first study to evaluate and compare a wide range of publicly available LLMs to determine appropriateness of their neuroradiology imaging recommendations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of imaging informatics in medicine

自引率

0.00%

发文量