Evaluation of a digital ophthalmologist app built by GPT4-V(ision)

Pusheng Xu, Xiaolan Chen, Ziwei Zhao, Yingfeng Zheng, Guangming Jin, Danli Shi, Mingguang He
{"title":"Evaluation of a digital ophthalmologist app built by GPT4-V(ision)","authors":"Pusheng Xu, Xiaolan Chen, Ziwei Zhao, Yingfeng Zheng, Guangming Jin, Danli Shi, Mingguang He","doi":"10.1101/2023.11.27.23299056","DOIUrl":null,"url":null,"abstract":"Backgrounds: GPT4-V(ision) has generated great interest across various fields, while its performance in ocular multimodal images is still unknown. This study aims to evaluate the capabilities of a GPT-4V-based chatbot in addressing queries related to ocular multimodal images.\nMethods: A digital ophthalmologist app was built based on GPT-4V. The evaluation dataset comprised various ocular imaging modalities: slit-lamp, scanning laser ophthalmoscopy (SLO), fundus photography of the posterior pole (FPP), optical coherence tomography (OCT), fundus fluorescein angiography (FFA), and ocular ultrasound (OUS). Each modality included images representing 5 common and 5 rare diseases. The chatbot was presented with ten questions per image, focusing on examination identification, lesion detection, diagnosis, decision support, and the repeatability of diagnosis. The responses of GPT-4V were evaluated based on accuracy, usability, and safety.\nResults: There was a substantial agreement among three ophthalmologists. Out of 600 responses,30.5% were accurate,22.8% of 540 responses were highly usable, and 55.5% of 540 responses were considered safe by ophthalmologists. The chatbot excelled in interpreting slit-lamp images, with 42.0%,42.2%, and 68.5% of the responses being accurate, highly usable, and no harm, respectively. However, its performance was notably weaker in FPP images, with only 13.7%,3.7%, and 38.5% in the same categories. It correctly identified 95.6% of the imaging modalities. For lesion identification, diagnosis, and decision support, the chatbot's accuracy was 25.6%,16.1%, and 24.0%, respectively. The average proportions of correct answers, highly usable, and no harm for GPT-4V in common diseases were 37.9%,30.5%, and 60.1%, respectively. These proportions were all higher compared to those in rare diseases, which were 23.2% (P<0.001),15.2% (P<0.001), and 51.1% (P=0.032), respectively. The overall repeatability of GPT4-V in diagnosing ocular images was 63% (38/60). Conclusion: Currently, GPT-4V lacks the reliability required for clinical decision-making and patient consultation in ophthalmology. Ongoing refinement and testing are essential for improving the efficacy of large language models in medical applications.","PeriodicalId":501390,"journal":{"name":"medRxiv - Ophthalmology","volume":"11 3-4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Ophthalmology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.11.27.23299056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Backgrounds: GPT4-V(ision) has generated great interest across various fields, while its performance in ocular multimodal images is still unknown. This study aims to evaluate the capabilities of a GPT-4V-based chatbot in addressing queries related to ocular multimodal images. Methods: A digital ophthalmologist app was built based on GPT-4V. The evaluation dataset comprised various ocular imaging modalities: slit-lamp, scanning laser ophthalmoscopy (SLO), fundus photography of the posterior pole (FPP), optical coherence tomography (OCT), fundus fluorescein angiography (FFA), and ocular ultrasound (OUS). Each modality included images representing 5 common and 5 rare diseases. The chatbot was presented with ten questions per image, focusing on examination identification, lesion detection, diagnosis, decision support, and the repeatability of diagnosis. The responses of GPT-4V were evaluated based on accuracy, usability, and safety. Results: There was a substantial agreement among three ophthalmologists. Out of 600 responses,30.5% were accurate,22.8% of 540 responses were highly usable, and 55.5% of 540 responses were considered safe by ophthalmologists. The chatbot excelled in interpreting slit-lamp images, with 42.0%,42.2%, and 68.5% of the responses being accurate, highly usable, and no harm, respectively. However, its performance was notably weaker in FPP images, with only 13.7%,3.7%, and 38.5% in the same categories. It correctly identified 95.6% of the imaging modalities. For lesion identification, diagnosis, and decision support, the chatbot's accuracy was 25.6%,16.1%, and 24.0%, respectively. The average proportions of correct answers, highly usable, and no harm for GPT-4V in common diseases were 37.9%,30.5%, and 60.1%, respectively. These proportions were all higher compared to those in rare diseases, which were 23.2% (P<0.001),15.2% (P<0.001), and 51.1% (P=0.032), respectively. The overall repeatability of GPT4-V in diagnosing ocular images was 63% (38/60). Conclusion: Currently, GPT-4V lacks the reliability required for clinical decision-making and patient consultation in ophthalmology. Ongoing refinement and testing are essential for improving the efficacy of large language models in medical applications.
GPT4-V构建的数字眼科医生应用评估(视觉)
背景:GPT4-V(vision)已经引起了各个领域的极大兴趣,但其在眼多模态图像中的表现仍然未知。本研究旨在评估基于gpt - v4的聊天机器人在处理与眼多模态图像相关的查询方面的能力。方法:基于GPT-4V构建数字眼科医生应用程序。评估数据集包括各种眼部成像方式:裂隙灯、扫描激光眼底镜(SLO)、眼底后极摄影(FPP)、光学相干断层扫描(OCT)、眼底荧光素血管造影(FFA)和眼超声(OUS)。每种模式包括5种常见疾病和5种罕见疾病的图像。每张图像向聊天机器人提出十个问题,重点是检查识别,病变检测,诊断,决策支持和诊断的可重复性。根据准确性、可用性和安全性对GPT-4V的反应进行评估。结果:三位眼科医生的结论基本一致。在600个回答中,30.5%是准确的,540个回答中22.8%是高度可用的,540个回答中55.5%被眼科医生认为是安全的。聊天机器人在解读裂隙灯图像方面表现出色,分别有42.0%、42.2%和68.5%的回答是准确的、高度可用的和无害的。然而,其在FPP图像中的表现明显较弱,在同一类别中仅占13.7%,3.7%和38.5%。它正确识别了95.6%的成像模式。在病变识别、诊断和决策支持方面,聊天机器人的准确率分别为25.6%、16.1%和24.0%。GPT-4V对常见病的正确率、高可用性和无危害的平均比例分别为37.9%、30.5%和60.1%。这些比例均高于罕见病患者,分别为23.2% (P= 0.001)、15.2% (P= 0.001)和51.1% (P=0.032)。GPT4-V诊断眼部图像的总体重复性为63%(38/60)。结论:目前GPT-4V在眼科临床决策和患者会诊中缺乏所需的可靠性。不断的改进和测试对于提高大型语言模型在医学应用中的有效性至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信