Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine.

ArXiv Pub Date : 2024-08-31

Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M Cheung, Robert Chen, Ronald M Summers, Justin F Rousseau, Peiyun Ni, Marc J Landsman, Sally L Baxter, Subhi J Al'Aref, Yijia Li, Alexander Chen, Josef A Brejt, Michael F Chiang, Yifan Peng, Zhiyong Lu

{"title":"Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine.","authors":"Qiao Jin, Fangyuan Chen, Yiliang Zhou, Ziyang Xu, Justin M Cheung, Robert Chen, Ronald M Summers, Justin F Rousseau, Peiyun Ni, Marc J Landsman, Sally L Baxter, Subhi J Al'Aref, Yijia Li, Alexander Chen, Josef A Brejt, Michael F Chiang, Yifan Peng, Zhiyong Lu","doi":"","DOIUrl":null,"url":null,"abstract":"Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.","PeriodicalId":93888,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10896362/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V's rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges - an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4V's high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.

本刊更多论文

GPT-4 医学视觉专家级精确度背后的隐患。

最近的研究表明，带视觉的生成预训练变换器 4（GPT-4V）在医疗挑战任务中的表现优于人类医生。然而，这些评估主要集中在多选题的准确性上。我们的研究扩展了目前的研究范围，全面分析了 GPT-4V 在解决《新英格兰医学杂志》（NEJM）图像挑战时的图像理解、医学知识回忆和分步多模态推理能力。评估结果证实，GPT-4V 在多选准确率方面优于人类医生（88.0% 对 77.0%，P=0.034）。在医生回答错误的情况下，GPT-4V 也表现出色，准确率超过 80%。然而，我们发现 GPT-4V 在做出正确的最终选择（27.3%）时，经常会提出有缺陷的理由，这在图像理解方面最为突出（21.6%）。尽管 GPT-4V 在多选题中的准确率很高，但我们的发现强调，在将此类模型整合到临床工作流程之前，有必要进一步深入评估其合理性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ArXiv

自引率

0.00%

发文量