CT和MRI检查前患者教育的多个最先进的大型语言模型的比较。

IF 3 3区医学 Q2 HEALTH CARE SCIENCES & SERVICES

Journal of Personalized Medicine Pub Date : 2025-06-05 DOI:10.3390/jpm15060235

Semil Eminovic, Bogdan Levita, Andrea Dell'Orco, Jonas Alexander Leppig, Jawed Nawabi, Tobias Penzkofer

{"title":"CT和MRI检查前患者教育的多个最先进的大型语言模型的比较。","authors":"Semil Eminovic, Bogdan Levita, Andrea Dell'Orco, Jonas Alexander Leppig, Jawed Nawabi, Tobias Penzkofer","doi":"10.3390/jpm15060235","DOIUrl":null,"url":null,"abstract":"Background/Objectives: This study compares the accuracy of responses from state-of-the-art large language models (LLMs) to patient questions before CT and MRI imaging. We aim to demonstrate the potential of LLMs in improving workflow efficiency, while also highlighting risks such as misinformation. Methods: There were 57 CT-related and 64 MRI-related patient questions displayed to ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini, and Mistral Large 2. Each answer was evaluated by two board-certified radiologists and scored for accuracy/correctness/likelihood to mislead using a 5-point Likert scale. Statistics compared LLM performance across question categories. Results: ChatGPT-4o achieved the highest average scores for CT-related questions and tied with Claude 3.5 Sonnet for MRI-related questions, with higher scores across all models for MRI (ChatGPT-4o: CT [4.52 (± 0.46)], MRI: [4.79 (± 0.37)]; Google Gemini: CT [4.44 (± 0.58)]; MRI [4.68 (± 0.58)]; Claude 3.5 Sonnet: CT [4.40 (± 0.59)]; MRI [4.79 (± 0.37)]; Mistral Large 2: CT [4.25 (± 0.54)]; MRI [4.74 (± 0.47)]). At least one response per LLM was rated as inaccurate, with Google Gemini answering most often potentially misleading (in 5.26% for CT and 2.34% for MRI). Mistral Large 2 was outperformed by ChatGPT-4o for all CT-related questions (p < 0.001) and by ChatGPT-4o (p = 0.003), Google Gemini (p = 0.022), and Claude 3.5 Sonnet (p = 0.004) for all CT Contrast media information questions. Conclusions: Even though all LLMs performed well overall and showed great potential for patient education, each model occasionally displayed potentially misleading information, highlighting the clinical application risk.","PeriodicalId":16722,"journal":{"name":"Journal of Personalized Medicine","volume":"15 6","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12194482/pdf/","citationCount":"0","resultStr":"{\"title\":\"Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations.\",\"authors\":\"Semil Eminovic, Bogdan Levita, Andrea Dell'Orco, Jonas Alexander Leppig, Jawed Nawabi, Tobias Penzkofer\",\"doi\":\"10.3390/jpm15060235\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background/Objectives: This study compares the accuracy of responses from state-of-the-art large language models (LLMs) to patient questions before CT and MRI imaging. We aim to demonstrate the potential of LLMs in improving workflow efficiency, while also highlighting risks such as misinformation. Methods: There were 57 CT-related and 64 MRI-related patient questions displayed to ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini, and Mistral Large 2. Each answer was evaluated by two board-certified radiologists and scored for accuracy/correctness/likelihood to mislead using a 5-point Likert scale. Statistics compared LLM performance across question categories. Results: ChatGPT-4o achieved the highest average scores for CT-related questions and tied with Claude 3.5 Sonnet for MRI-related questions, with higher scores across all models for MRI (ChatGPT-4o: CT [4.52 (± 0.46)], MRI: [4.79 (± 0.37)]; Google Gemini: CT [4.44 (± 0.58)]; MRI [4.68 (± 0.58)]; Claude 3.5 Sonnet: CT [4.40 (± 0.59)]; MRI [4.79 (± 0.37)]; Mistral Large 2: CT [4.25 (± 0.54)]; MRI [4.74 (± 0.47)]). At least one response per LLM was rated as inaccurate, with Google Gemini answering most often potentially misleading (in 5.26% for CT and 2.34% for MRI). Mistral Large 2 was outperformed by ChatGPT-4o for all CT-related questions (p < 0.001) and by ChatGPT-4o (p = 0.003), Google Gemini (p = 0.022), and Claude 3.5 Sonnet (p = 0.004) for all CT Contrast media information questions. Conclusions: Even though all LLMs performed well overall and showed great potential for patient education, each model occasionally displayed potentially misleading information, highlighting the clinical application risk.\",\"PeriodicalId\":16722,\"journal\":{\"name\":\"Journal of Personalized Medicine\",\"volume\":\"15 6\",\"pages\":\"\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2025-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12194482/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Personalized Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3390/jpm15060235\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Personalized Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3390/jpm15060235","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景/目的：本研究比较了最先进的大型语言模型（LLMs）在CT和MRI成像前对患者问题的回答的准确性。我们的目标是展示法学硕士在提高工作流程效率方面的潜力，同时也强调错误信息等风险。方法：向chatgpt - 40、Claude 3.5 Sonnet、谷歌Gemini和Mistral Large 2显示57个ct相关和64个mri相关的患者问题。每个答案都由两名委员会认证的放射科医生评估，并使用5分李克特量表对准确性/正确性/误导可能性进行评分。统计学比较了不同问题类别的法学硕士表现。结果：chatgpt - 40在CT相关问题上获得了最高的平均分，在MRI相关问题上与Claude 3.5 Sonnet并列，在所有MRI模型中得分更高(chatgpt - 40: CT[4.52（±0.46）],MRI:[4.79（±0.37）]；谷歌Gemini: CT[4.44（±0.58）]；Mri[4.68（±0.58）]；克劳德3.5十四行诗：CT[4.40（±0.59）]；Mri[4.79（±0.37）]；Mistral Large 2: CT[4.25（±0.54）]；Mri[4.74（±0.47）])。每个LLM至少有一个回答被评为不准确，谷歌双子座回答最容易产生误导（CT为5.26%，MRI为2.34%）。chatgpt - 40在所有CT相关问题上的表现优于Mistral Large 2 (p < 0.001)，在所有CT造影剂信息问题上的表现优于chatgpt - 40 （p = 0.003）、谷歌Gemini （p = 0.022）和Claude 3.5 Sonnet （p = 0.004）。结论：尽管所有llm总体上表现良好，在患者教育方面显示出巨大的潜力，但每个模型偶尔都会显示出潜在的误导性信息，突出了临床应用的风险。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparison of Multiple State-of-the-Art Large Language Models for Patient Education Prior to CT and MRI Examinations.

Background/Objectives: This study compares the accuracy of responses from state-of-the-art large language models (LLMs) to patient questions before CT and MRI imaging. We aim to demonstrate the potential of LLMs in improving workflow efficiency, while also highlighting risks such as misinformation. Methods: There were 57 CT-related and 64 MRI-related patient questions displayed to ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini, and Mistral Large 2. Each answer was evaluated by two board-certified radiologists and scored for accuracy/correctness/likelihood to mislead using a 5-point Likert scale. Statistics compared LLM performance across question categories. Results: ChatGPT-4o achieved the highest average scores for CT-related questions and tied with Claude 3.5 Sonnet for MRI-related questions, with higher scores across all models for MRI (ChatGPT-4o: CT [4.52 (± 0.46)], MRI: [4.79 (± 0.37)]; Google Gemini: CT [4.44 (± 0.58)]; MRI [4.68 (± 0.58)]; Claude 3.5 Sonnet: CT [4.40 (± 0.59)]; MRI [4.79 (± 0.37)]; Mistral Large 2: CT [4.25 (± 0.54)]; MRI [4.74 (± 0.47)]). At least one response per LLM was rated as inaccurate, with Google Gemini answering most often potentially misleading (in 5.26% for CT and 2.34% for MRI). Mistral Large 2 was outperformed by ChatGPT-4o for all CT-related questions (p < 0.001) and by ChatGPT-4o (p = 0.003), Google Gemini (p = 0.022), and Claude 3.5 Sonnet (p = 0.004) for all CT Contrast media information questions. Conclusions: Even though all LLMs performed well overall and showed great potential for patient education, each model occasionally displayed potentially misleading information, highlighting the clinical application risk.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Personalized Medicine Medicine-Medicine (miscellaneous)

CiteScore

4.10

自引率

0.00%

发文量

1878

审稿时长

11 weeks

期刊介绍： Journal of Personalized Medicine (JPM; ISSN 2075-4426) is an international, open access journal aimed at bringing all aspects of personalized medicine to one platform. JPM publishes cutting edge, innovative preclinical and translational scientific research and technologies related to personalized medicine (e.g., pharmacogenomics/proteomics, systems biology). JPM recognizes that personalized medicine—the assessment of genetic, environmental and host factors that cause variability of individuals—is a challenging, transdisciplinary topic that requires discussions from a range of experts. For a comprehensive perspective of personalized medicine, JPM aims to integrate expertise from the molecular and translational sciences, therapeutics and diagnostics, as well as discussions of regulatory, social, ethical and policy aspects. We provide a forum to bring together academic and clinical researchers, biotechnology, diagnostic and pharmaceutical companies, health professionals, regulatory and ethical experts, and government and regulatory authorities.