多模态大语言模型在放射测试案例中的诊断性能：提示工程和输入条件的影响。

IF 2.4 3区医学 Q2 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Ultrasonography Pub Date : 2025-05-01 Epub Date: 2025-03-11 DOI:10.14366/usg.25012

Taewon Han, Woo Kyoung Jeong, Jaeseung Shin

{"title":"多模态大语言模型在放射测试案例中的诊断性能：提示工程和输入条件的影响。","authors":"Taewon Han, Woo Kyoung Jeong, Jaeseung Shin","doi":"10.14366/usg.25012","DOIUrl":null,"url":null,"abstract":"Purpose: This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.Methods: This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]-generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.Results: With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with improvements over the basic (5.5%, P=0.035), chain-of-thought (4.0%, P=0.169), and multiagent prompts (3.5%, P=0.248). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. non-rare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).Conclusion: Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.","PeriodicalId":54227,"journal":{"name":"Ultrasonography","volume":"44 3","pages":"220-231"},"PeriodicalIF":2.4000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12081132/pdf/","citationCount":"0","resultStr":"{\"title\":\"Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions.\",\"authors\":\"Taewon Han, Woo Kyoung Jeong, Jaeseung Shin\",\"doi\":\"10.14366/usg.25012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.Methods: This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]-generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.Results: With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with improvements over the basic (5.5%, P=0.035), chain-of-thought (4.0%, P=0.169), and multiagent prompts (3.5%, P=0.248). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. non-rare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).Conclusion: Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.\",\"PeriodicalId\":54227,\"journal\":{\"name\":\"Ultrasonography\",\"volume\":\"44 3\",\"pages\":\"220-231\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12081132/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ultrasonography\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.14366/usg.25012\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/3/11 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ultrasonography","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.14366/usg.25012","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/11 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究旨在评估三种多模态大语言模型（llm）在放射图像解释中的诊断准确性，并评估快速工程策略和输入条件的影响。方法：对韩国超声医学学会的67例放射学试题进行分析。三个多模式llm （Claude 3.5 Sonnet， gpt - 40和Gemini-1.5-Pro-002）使用六种类型的提示（基本[无系统提示]，原始[特定指令]，思维链，反思，多代理和人工智能[AI]生成）进行评估。通过多种因素评估，包括肿瘤与非肿瘤状态、病例罕见度、难度和知识截止日期。亚组分析比较了仅成像输入和联合成像-描述文本输入之间的诊断准确性。结果：仅使用成像输入时，Claude 3.5 Sonnet的总体准确率最高（46.3%,186/402），其次是gpt - 40（43.5%, 175/402）和gemini -1.5 pro -002（39.8%, 160/402）。人工智能生成的提示在所有三种模型中都产生了更高的综合准确性，比基本提示（5.5%，P=0.035）、思维链提示（4.0%，P=0.169）和多智能体提示（3.5%，P=0.248）都有改进。描述性文本的整合显著提高了Claude 3.5 Sonnet的诊断准确率（46.3%至66.2%）。结论：Claude 3.5 Sonnet在放射测验病例中的诊断准确率最高，其次是gpt - 40和Gemini-1.5-Pro-002。使用人工智能生成的提示和描述性文本输入的集成增强了模型的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions.

Purpose: This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.

Methods: This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]-generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.

Results: With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with improvements over the basic (5.5%, P=0.035), chain-of-thought (4.0%, P=0.169), and multiagent prompts (3.5%, P=0.248). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. non-rare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).

Conclusion: Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Ultrasonography Medicine-Radiology, Nuclear Medicine and Imaging

CiteScore

5.10

自引率

6.50%

发文量

审稿时长

15 weeks

期刊介绍： Ultrasonography, the official English-language journal of the Korean Society of Ultrasound in Medicine (KSUM), is an international peer-reviewed academic journal dedicated to practice, research, technology, and education dealing with medical ultrasound. It is renamed from the Journal of Korean Society of Ultrasound in Medicine in January 2014, and published four times per year: January 1, April 1, July 1, and October 1. Original articles, technical notes, topical reviews, perspectives, pictorial essays, and timely editorial materials are published in Ultrasonography covering state-of-the-art content. Ultrasonography aims to provide updated information on new diagnostic concepts and technical developments, including experimental animal studies using new equipment in addition to well-designed reviews of contemporary issues in patient care. Along with running KSUM Open, the annual international congress of KSUM, Ultrasonography also serves as a medium for cooperation among physicians and specialists from around the world who are focusing on various ultrasound technology and disease problems and relevant basic science.