Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations.

IF 3 3区 医学 Q2 HEALTH CARE SCIENCES & SERVICES
Semil Eminovic, Bogdan Levita, Andrea Dell'Orco, Jonas Alexander Leppig, Jawed Nawabi, Tobias Penzkofer
{"title":"Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations.","authors":"Semil Eminovic, Bogdan Levita, Andrea Dell'Orco, Jonas Alexander Leppig, Jawed Nawabi, Tobias Penzkofer","doi":"10.3390/jpm15060235","DOIUrl":null,"url":null,"abstract":"<p><p><b>Background/Objectives</b>: This study compares the accuracy of responses from state-of-the-art large language models (LLMs) to patient questions before CT and MRI imaging. We aim to demonstrate the potential of LLMs in improving workflow efficiency, while also highlighting risks such as misinformation. <b>Methods</b>: There were 57 CT-related and 64 MRI-related patient questions displayed to ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini, and Mistral Large 2. Each answer was evaluated by two board-certified radiologists and scored for accuracy/correctness/likelihood to mislead using a 5-point Likert scale. Statistics compared LLM performance across question categories. <b>Results</b>: ChatGPT-4o achieved the highest average scores for CT-related questions and tied with Claude 3.5 Sonnet for MRI-related questions, with higher scores across all models for MRI (ChatGPT-4o: CT [4.52 (± 0.46)], MRI: [4.79 (± 0.37)]; Google Gemini: CT [4.44 (± 0.58)]; MRI [4.68 (± 0.58)]; Claude 3.5 Sonnet: CT [4.40 (± 0.59)]; MRI [4.79 (± 0.37)]; Mistral Large 2: CT [4.25 (± 0.54)]; MRI [4.74 (± 0.47)]). At least one response per LLM was rated as inaccurate, with Google Gemini answering most often potentially misleading (in 5.26% for CT and 2.34% for MRI). Mistral Large 2 was outperformed by ChatGPT-4o for all CT-related questions (<i>p</i> < 0.001) and by ChatGPT-4o (<i>p</i> = 0.003), Google Gemini (<i>p</i> = 0.022), and Claude 3.5 Sonnet (<i>p</i> = 0.004) for all CT Contrast media information questions. <b>Conclusions</b>: Even though all LLMs performed well overall and showed great potential for patient education, each model occasionally displayed potentially misleading information, highlighting the clinical application risk.</p>","PeriodicalId":16722,"journal":{"name":"Journal of Personalized Medicine","volume":"15 6","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12194482/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Personalized Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/jpm15060235","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background/Objectives: This study compares the accuracy of responses from state-of-the-art large language models (LLMs) to patient questions before CT and MRI imaging. We aim to demonstrate the potential of LLMs in improving workflow efficiency, while also highlighting risks such as misinformation. Methods: There were 57 CT-related and 64 MRI-related patient questions displayed to ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini, and Mistral Large 2. Each answer was evaluated by two board-certified radiologists and scored for accuracy/correctness/likelihood to mislead using a 5-point Likert scale. Statistics compared LLM performance across question categories. Results: ChatGPT-4o achieved the highest average scores for CT-related questions and tied with Claude 3.5 Sonnet for MRI-related questions, with higher scores across all models for MRI (ChatGPT-4o: CT [4.52 (± 0.46)], MRI: [4.79 (± 0.37)]; Google Gemini: CT [4.44 (± 0.58)]; MRI [4.68 (± 0.58)]; Claude 3.5 Sonnet: CT [4.40 (± 0.59)]; MRI [4.79 (± 0.37)]; Mistral Large 2: CT [4.25 (± 0.54)]; MRI [4.74 (± 0.47)]). At least one response per LLM was rated as inaccurate, with Google Gemini answering most often potentially misleading (in 5.26% for CT and 2.34% for MRI). Mistral Large 2 was outperformed by ChatGPT-4o for all CT-related questions (p < 0.001) and by ChatGPT-4o (p = 0.003), Google Gemini (p = 0.022), and Claude 3.5 Sonnet (p = 0.004) for all CT Contrast media information questions. Conclusions: Even though all LLMs performed well overall and showed great potential for patient education, each model occasionally displayed potentially misleading information, highlighting the clinical application risk.

CT和MRI检查前患者教育的多个最先进的大型语言模型的比较。
背景/目的:本研究比较了最先进的大型语言模型(LLMs)在CT和MRI成像前对患者问题的回答的准确性。我们的目标是展示法学硕士在提高工作流程效率方面的潜力,同时也强调错误信息等风险。方法:向chatgpt - 40、Claude 3.5 Sonnet、谷歌Gemini和Mistral Large 2显示57个ct相关和64个mri相关的患者问题。每个答案都由两名委员会认证的放射科医生评估,并使用5分李克特量表对准确性/正确性/误导可能性进行评分。统计学比较了不同问题类别的法学硕士表现。结果:chatgpt - 40在CT相关问题上获得了最高的平均分,在MRI相关问题上与Claude 3.5 Sonnet并列,在所有MRI模型中得分更高(chatgpt - 40: CT[4.52(±0.46)],MRI:[4.79(±0.37)];谷歌Gemini: CT[4.44(±0.58)];Mri[4.68(±0.58)];克劳德3.5十四行诗:CT[4.40(±0.59)];Mri[4.79(±0.37)];Mistral Large 2: CT[4.25(±0.54)];Mri[4.74(±0.47)])。每个LLM至少有一个回答被评为不准确,谷歌双子座回答最容易产生误导(CT为5.26%,MRI为2.34%)。chatgpt - 40在所有CT相关问题上的表现优于Mistral Large 2 (p < 0.001),在所有CT造影剂信息问题上的表现优于chatgpt - 40 (p = 0.003)、谷歌Gemini (p = 0.022)和Claude 3.5 Sonnet (p = 0.004)。结论:尽管所有llm总体上表现良好,在患者教育方面显示出巨大的潜力,但每个模型偶尔都会显示出潜在的误导性信息,突出了临床应用的风险。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Personalized Medicine
Journal of Personalized Medicine Medicine-Medicine (miscellaneous)
CiteScore
4.10
自引率
0.00%
发文量
1878
审稿时长
11 weeks
期刊介绍: Journal of Personalized Medicine (JPM; ISSN 2075-4426) is an international, open access journal aimed at bringing all aspects of personalized medicine to one platform. JPM publishes cutting edge, innovative preclinical and translational scientific research and technologies related to personalized medicine (e.g., pharmacogenomics/proteomics, systems biology). JPM recognizes that personalized medicine—the assessment of genetic, environmental and host factors that cause variability of individuals—is a challenging, transdisciplinary topic that requires discussions from a range of experts. For a comprehensive perspective of personalized medicine, JPM aims to integrate expertise from the molecular and translational sciences, therapeutics and diagnostics, as well as discussions of regulatory, social, ethical and policy aspects. We provide a forum to bring together academic and clinical researchers, biotechnology, diagnostic and pharmaceutical companies, health professionals, regulatory and ethical experts, and government and regulatory authorities.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信