Evaluating the Diagnostic and Treatment Recommendation Capabilities of GPT-4 Vision in Dermatology

Abhinav Pillai, Sharon Parappally-Joseph, Jori Hardin
{"title":"Evaluating the Diagnostic and Treatment Recommendation Capabilities of GPT-4 Vision in Dermatology","authors":"Abhinav Pillai, Sharon Parappally-Joseph, Jori Hardin","doi":"10.1101/2024.01.24.24301743","DOIUrl":null,"url":null,"abstract":"Background: The integration of artificial intelligence (AI) in dermatology presents a promising frontier for enhancing diagnostic accuracy and treatment planning. However, general purpose AI models require rigorous evaluation before being applied to real-world medical cases.\nObjective: This project specifically evaluates GPT-4V's performance in accurately diagnosing and generating treatment plans for common dermatological conditions, comparing its assessment of textual versus image data and its performance with multimodal inputs. Beyond the immediate scope, this study contributes to the broader trajectory of integrating AI in healthcare, highlighting the limitations of these technologies, as well as their potential to enhance efficiency, and education within medical training and practice.\nMethods: A dataset of 102 images representing nine common dermatological conditions was compiled from open-access websites. Fifty-four images were ultimately selected by two board- certified dermatologists as being representative and typical of the common conditions. Additionally, nine clinical scenarios corresponding to these conditions were developed. GPT- 4V's diagnostic capabilities were assessed in three setups: Image Prompt (image-based), Scenario Prompt (text-based), and Image and Scenario Prompt (combining both modalities). The model's performance was evaluated based on diagnostic accuracy, differential diagnosis, and treatment recommendations.\nResults: In the Image Prompt setup, GPT-4V correctly identified the primary diagnosis for 29 of 54 images. The Scenario Prompt setup showed a higher accuracy rate of 89% in identifying the primary diagnosis. The multimodal Image and Scenario Prompt setup also achieved an 89% accuracy rate. However, a notable bias towards textual data over visual data was observed. Treatment recommendations were evaluated by the same two dermatologists, using a modified Entrustment Scale, showing competent but not expert-level performance.\nConclusion: GPT-4V demonstrates promising capabilities in dermatological diagnosis and treatment recommendations, particularly in text-based scenarios. However, its performance in image-based diagnosis and integration of multimodal data highlights areas for improvement. The study underscores the potential of AI in augmenting dermatological practice, emphasizing the need for further development, and fine-tuning of such models to ensure their efficacy and reliability in clinical settings.\nKeywords: Artificial Intelligence; Dermatology, GPT-4V; Diagnostic Accuracy; Treatment Planning; Multimodal AI; Large Language Model.","PeriodicalId":501385,"journal":{"name":"medRxiv - Dermatology","volume":"180 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Dermatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.01.24.24301743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The integration of artificial intelligence (AI) in dermatology presents a promising frontier for enhancing diagnostic accuracy and treatment planning. However, general purpose AI models require rigorous evaluation before being applied to real-world medical cases. Objective: This project specifically evaluates GPT-4V's performance in accurately diagnosing and generating treatment plans for common dermatological conditions, comparing its assessment of textual versus image data and its performance with multimodal inputs. Beyond the immediate scope, this study contributes to the broader trajectory of integrating AI in healthcare, highlighting the limitations of these technologies, as well as their potential to enhance efficiency, and education within medical training and practice. Methods: A dataset of 102 images representing nine common dermatological conditions was compiled from open-access websites. Fifty-four images were ultimately selected by two board- certified dermatologists as being representative and typical of the common conditions. Additionally, nine clinical scenarios corresponding to these conditions were developed. GPT- 4V's diagnostic capabilities were assessed in three setups: Image Prompt (image-based), Scenario Prompt (text-based), and Image and Scenario Prompt (combining both modalities). The model's performance was evaluated based on diagnostic accuracy, differential diagnosis, and treatment recommendations. Results: In the Image Prompt setup, GPT-4V correctly identified the primary diagnosis for 29 of 54 images. The Scenario Prompt setup showed a higher accuracy rate of 89% in identifying the primary diagnosis. The multimodal Image and Scenario Prompt setup also achieved an 89% accuracy rate. However, a notable bias towards textual data over visual data was observed. Treatment recommendations were evaluated by the same two dermatologists, using a modified Entrustment Scale, showing competent but not expert-level performance. Conclusion: GPT-4V demonstrates promising capabilities in dermatological diagnosis and treatment recommendations, particularly in text-based scenarios. However, its performance in image-based diagnosis and integration of multimodal data highlights areas for improvement. The study underscores the potential of AI in augmenting dermatological practice, emphasizing the need for further development, and fine-tuning of such models to ensure their efficacy and reliability in clinical settings. Keywords: Artificial Intelligence; Dermatology, GPT-4V; Diagnostic Accuracy; Treatment Planning; Multimodal AI; Large Language Model.
评估 GPT-4 Vision 在皮肤科中的诊断和治疗建议能力
背景:人工智能(AI)与皮肤病学的结合为提高诊断准确性和治疗计划提供了一个前景广阔的领域。然而,通用人工智能模型在应用于实际医疗案例之前需要进行严格的评估:本项目专门评估了 GPT-4V 在准确诊断常见皮肤病并生成治疗方案方面的性能,比较了它对文本数据和图像数据的评估,以及它在多模态输入方面的性能。除了眼前的研究范围外,本研究还有助于将人工智能融入医疗保健领域的更广阔的发展轨迹,突出了这些技术的局限性,以及它们在医疗培训和实践中提高效率和教育的潜力:方法:我们从开放访问的网站上收集了102张图片,这些图片代表了九种常见的皮肤病。最终,由两名获得认证的皮肤科医生挑选出 54 张具有代表性和典型性的图片。此外,还开发了与这些病症相对应的九种临床情景。GPT- 4V 的诊断能力通过三种设置进行了评估:图像提示(基于图像)、情景提示(基于文本)以及图像和情景提示(结合两种模式)。根据诊断准确性、鉴别诊断和治疗建议对模型的性能进行了评估:结果:在图像提示设置中,GPT-4V 正确识别了 54 张图像中 29 张的主要诊断。情景提示设置在确定主要诊断方面的准确率更高,达到 89%。多模态图像和情景提示设置的准确率也达到了 89%。不过,文本数据明显偏向于视觉数据。同样是两位皮肤科医生使用修改后的委托量表对治疗建议进行了评估,结果显示其表现合格,但未达到专家级水平:结论:GPT-4V 在皮肤病诊断和治疗建议方面表现出良好的能力,尤其是在基于文本的情况下。结论:GPT-4V 在皮肤病诊断和治疗建议方面表现出了很好的能力,尤其是在基于文本的场景中。然而,它在基于图像的诊断和多模态数据整合方面的表现还需要改进。这项研究强调了人工智能在增强皮肤科实践方面的潜力,同时也强调了进一步开发和微调此类模型的必要性,以确保其在临床环境中的有效性和可靠性:人工智能;皮肤病学;GPT-4V;诊断准确性;治疗计划;多模态人工智能;大型语言模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信