多模态大语言模型在放射测试案例中的诊断性能:提示工程和输入条件的影响。

IF 2.4 3区 医学 Q2 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Ultrasonography Pub Date : 2025-05-01 Epub Date: 2025-03-11 DOI:10.14366/usg.25012
Taewon Han, Woo Kyoung Jeong, Jaeseung Shin
{"title":"多模态大语言模型在放射测试案例中的诊断性能:提示工程和输入条件的影响。","authors":"Taewon Han, Woo Kyoung Jeong, Jaeseung Shin","doi":"10.14366/usg.25012","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.</p><p><strong>Methods: </strong>This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]-generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.</p><p><strong>Results: </strong>With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with improvements over the basic (5.5%, P=0.035), chain-of-thought (4.0%, P=0.169), and multiagent prompts (3.5%, P=0.248). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. non-rare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).</p><p><strong>Conclusion: </strong>Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.</p>","PeriodicalId":54227,"journal":{"name":"Ultrasonography","volume":"44 3","pages":"220-231"},"PeriodicalIF":2.4000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12081132/pdf/","citationCount":"0","resultStr":"{\"title\":\"Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions.\",\"authors\":\"Taewon Han, Woo Kyoung Jeong, Jaeseung Shin\",\"doi\":\"10.14366/usg.25012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.</p><p><strong>Methods: </strong>This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]-generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.</p><p><strong>Results: </strong>With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with improvements over the basic (5.5%, P=0.035), chain-of-thought (4.0%, P=0.169), and multiagent prompts (3.5%, P=0.248). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. non-rare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).</p><p><strong>Conclusion: </strong>Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.</p>\",\"PeriodicalId\":54227,\"journal\":{\"name\":\"Ultrasonography\",\"volume\":\"44 3\",\"pages\":\"220-231\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12081132/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ultrasonography\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.14366/usg.25012\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/3/11 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ultrasonography","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.14366/usg.25012","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/11 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

摘要

目的:本研究旨在评估三种多模态大语言模型(llm)在放射图像解释中的诊断准确性,并评估快速工程策略和输入条件的影响。方法:对韩国超声医学学会的67例放射学试题进行分析。三个多模式llm (Claude 3.5 Sonnet, gpt - 40和Gemini-1.5-Pro-002)使用六种类型的提示(基本[无系统提示],原始[特定指令],思维链,反思,多代理和人工智能[AI]生成)进行评估。通过多种因素评估,包括肿瘤与非肿瘤状态、病例罕见度、难度和知识截止日期。亚组分析比较了仅成像输入和联合成像-描述文本输入之间的诊断准确性。结果:仅使用成像输入时,Claude 3.5 Sonnet的总体准确率最高(46.3%,186/402),其次是gpt - 40(43.5%, 175/402)和gemini -1.5 pro -002(39.8%, 160/402)。人工智能生成的提示在所有三种模型中都产生了更高的综合准确性,比基本提示(5.5%,P=0.035)、思维链提示(4.0%,P=0.169)和多智能体提示(3.5%,P=0.248)都有改进。描述性文本的整合显著提高了Claude 3.5 Sonnet的诊断准确率(46.3%至66.2%)。结论:Claude 3.5 Sonnet在放射测验病例中的诊断准确率最高,其次是gpt - 40和Gemini-1.5-Pro-002。使用人工智能生成的提示和描述性文本输入的集成增强了模型的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions.

Purpose: This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.

Methods: This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]-generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.

Results: With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with improvements over the basic (5.5%, P=0.035), chain-of-thought (4.0%, P=0.169), and multiagent prompts (3.5%, P=0.248). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. non-rare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).

Conclusion: Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Ultrasonography
Ultrasonography Medicine-Radiology, Nuclear Medicine and Imaging
CiteScore
5.10
自引率
6.50%
发文量
78
审稿时长
15 weeks
期刊介绍: Ultrasonography, the official English-language journal of the Korean Society of Ultrasound in Medicine (KSUM), is an international peer-reviewed academic journal dedicated to practice, research, technology, and education dealing with medical ultrasound. It is renamed from the Journal of Korean Society of Ultrasound in Medicine in January 2014, and published four times per year: January 1, April 1, July 1, and October 1. Original articles, technical notes, topical reviews, perspectives, pictorial essays, and timely editorial materials are published in Ultrasonography covering state-of-the-art content. Ultrasonography aims to provide updated information on new diagnostic concepts and technical developments, including experimental animal studies using new equipment in addition to well-designed reviews of contemporary issues in patient care. Along with running KSUM Open, the annual international congress of KSUM, Ultrasonography also serves as a medium for cooperation among physicians and specialists from around the world who are focusing on various ultrasound technology and disease problems and relevant basic science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信