Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance

Jamil S. Samaan, S. Md, N. Bs, A. Ba, Y. Ms, MSc Rajsavi Anand Md, F. S. Md, J. Ms, S. Ms, Ahmad Safavi-Naini, Bara El Kurdi Md, A. Md, MS Rabindra Watson Md, S. Md, M. J. G. Md, Mph Brennan M.R. Spiegel Md, N. P. T. Mshs
{"title":"Multimodal Large Language Model Passes Specialty Board Examination and Surpasses Human Test-Taker Scores: A Comparative Analysis Examining the Stepwise Impact of Model Prompting Strategies on Performance","authors":"Jamil S. Samaan, S. Md, N. Bs, A. Ba, Y. Ms, MSc Rajsavi Anand Md, F. S. Md, J. Ms, S. Ms, Ahmad Safavi-Naini, Bara El Kurdi Md, A. Md, MS Rabindra Watson Md, S. Md, M. J. G. Md, Mph Brennan M.R. Spiegel Md, N. P. T. Mshs","doi":"10.1101/2024.07.27.24310809","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) have shown promise in answering medical licensing examination-style questions. However, there is limited research on the performance of multimodal LLMs on subspecialty medical examinations. Our study benchmarks the performance of multimodal LLMs enhanced by model prompting strategies on gastroenterology subspecialty examination-style questions and examines how these prompting strategies incrementally improve overall performance. Methods: We used the 2022 American College of Gastroenterology (ACG) self-assessment examination (N=300). This test is typically completed by gastroenterology fellows and established gastroenterologists preparing for the gastroenterology subspecialty board examination. We employed a sequential implementation of model prompting strategies: prompt engineering, Retrieval-Augmented Generation (RAG), five-shot learning, and an LLM-powered answer validation revision model (AVRM). GPT-4 and Gemini Pro were tested. Results: Implementing all prompting strategies improved the overall score of GPT-4 from 60.3% to 80.7% and Gemini Pro from 48.0% to 54.3%. GPT-4's score surpassed the 70% passing threshold and 75% average human test-taker scores unlike Gemini Pro. Stratification of questions by difficulty showed the accuracy of both LLMs mirrored that of human examinees, demonstrating higher accuracy as human test-taker accuracy increased. The addition of the AVRM to prompt, RAG, and 5-shot increased GPT-4's accuracy by 4.4%. The incremental addition of model prompting strategies improved accuracy for both non-image (57.2% to 80.4%) and image-based (63.0% to 80.9%) questions for GPT-4, but not Gemini Pro. Conclusions: Our results underscore the value of model prompting strategies in improving LLM performance on subspecialty-level licensing exam questions. We also present a novel implementation of an LLM-powered reviewer model in the context of subspecialty medicine which further improved model performance when combined with other prompting strategies. Our findings highlight the potential future role of multimodal LLMs, particularly with the implementation of multiple model prompting strategies, as clinical decision support systems in subspecialty care for healthcare providers. Keywords: ChatGPT, Gemini pro, gastroenterology, RAG, prompt engineering, medical specialty examination.","PeriodicalId":506788,"journal":{"name":"medRxiv","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.27.24310809","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Large language models (LLMs) have shown promise in answering medical licensing examination-style questions. However, there is limited research on the performance of multimodal LLMs on subspecialty medical examinations. Our study benchmarks the performance of multimodal LLMs enhanced by model prompting strategies on gastroenterology subspecialty examination-style questions and examines how these prompting strategies incrementally improve overall performance. Methods: We used the 2022 American College of Gastroenterology (ACG) self-assessment examination (N=300). This test is typically completed by gastroenterology fellows and established gastroenterologists preparing for the gastroenterology subspecialty board examination. We employed a sequential implementation of model prompting strategies: prompt engineering, Retrieval-Augmented Generation (RAG), five-shot learning, and an LLM-powered answer validation revision model (AVRM). GPT-4 and Gemini Pro were tested. Results: Implementing all prompting strategies improved the overall score of GPT-4 from 60.3% to 80.7% and Gemini Pro from 48.0% to 54.3%. GPT-4's score surpassed the 70% passing threshold and 75% average human test-taker scores unlike Gemini Pro. Stratification of questions by difficulty showed the accuracy of both LLMs mirrored that of human examinees, demonstrating higher accuracy as human test-taker accuracy increased. The addition of the AVRM to prompt, RAG, and 5-shot increased GPT-4's accuracy by 4.4%. The incremental addition of model prompting strategies improved accuracy for both non-image (57.2% to 80.4%) and image-based (63.0% to 80.9%) questions for GPT-4, but not Gemini Pro. Conclusions: Our results underscore the value of model prompting strategies in improving LLM performance on subspecialty-level licensing exam questions. We also present a novel implementation of an LLM-powered reviewer model in the context of subspecialty medicine which further improved model performance when combined with other prompting strategies. Our findings highlight the potential future role of multimodal LLMs, particularly with the implementation of multiple model prompting strategies, as clinical decision support systems in subspecialty care for healthcare providers. Keywords: ChatGPT, Gemini pro, gastroenterology, RAG, prompt engineering, medical specialty examination.
多模态大语言模型通过专业委员会考试并超过人类考生分数:模型提示策略对成绩逐步影响的比较分析
背景:大语言模型(LLMs)在回答医学执业资格考试类型的问题方面大有可为。然而,有关多模态 LLM 在亚专业医学考试中的表现的研究却很有限。我们的研究对通过模型提示策略增强的多模态 LLM 在肠胃病学亚专科考试式问题上的表现进行了基准测试,并研究了这些提示策略如何逐步提高整体表现。方法:我们使用了 2022 年美国胃肠病学院(ACG)的自我评估考试(N=300)。准备参加胃肠病学亚专科医师资格考试的胃肠病学研究员和资深胃肠病学家通常会完成这项考试。我们采用了循序渐进的模型提示策略:提示工程(prompt engineering)、检索增强生成(RAG)、五次学习(five-shot learning)和由 LLM 驱动的答案验证修正模型(AVRM)。对 GPT-4 和 Gemini Pro 进行了测试。测试结果采用所有提示策略后,GPT-4 的总分从 60.3% 提高到 80.7%,Gemini Pro 从 48.0% 提高到 54.3%。与 Gemini Pro 不同的是,GPT-4 的得分超过了 70% 的及格线和 75% 的人类测试者平均得分。按难度对问题进行的分层显示,两种 LLM 的准确度与人类考生的准确度一致,随着人类考生准确度的提高,准确度也随之提高。将 AVRM 添加到提示、RAG 和 5 发中后,GPT-4 的准确率提高了 4.4%。逐步增加模型提示策略可提高 GPT-4 的非图像问题(从 57.2% 提高到 80.4%)和图像问题(从 63.0% 提高到 80.9%)的准确率,但不能提高 Gemini Pro 的准确率。结论:我们的研究结果强调了模型提示策略在提高 LLM 在亚专业水平执业资格考试问题上的成绩方面的价值。我们还介绍了在亚专科医学背景下由 LLM 驱动的审阅者模型的新实施方案,该方案与其他提示策略相结合,进一步提高了模型性能。我们的研究结果凸显了多模态 LLM 未来的潜在作用,尤其是在实施多种模型提示策略的情况下,LLM 将成为医疗保健提供者亚专科护理中的临床决策支持系统。关键词ChatGPT, Gemini pro, 胃肠病学, RAG, 提示工程, 医学专业考试。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信