Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy.

IF 3.7 2区 医学 Q1 OPHTHALMOLOGY
Kai Xiong Cheong, Chenxi Zhang, Tien-En Tan, Beau J Fenner, Wendy Meihua Wong, Kelvin Yc Teo, Ya Xing Wang, Sobha Sivaprasad, Pearse A Keane, Cecilia Sungmin Lee, Aaron Y Lee, Chui Ming Gemmy Cheung, Tien Yin Wong, Yun-Gyung Cheong, Su Jeong Song, Yih Chung Tham
{"title":"Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy.","authors":"Kai Xiong Cheong, Chenxi Zhang, Tien-En Tan, Beau J Fenner, Wendy Meihua Wong, Kelvin Yc Teo, Ya Xing Wang, Sobha Sivaprasad, Pearse A Keane, Cecilia Sungmin Lee, Aaron Y Lee, Chui Ming Gemmy Cheung, Tien Yin Wong, Yun-Gyung Cheong, Su Jeong Song, Yih Chung Tham","doi":"10.1136/bjo-2023-324533","DOIUrl":null,"url":null,"abstract":"<p><strong>Background/aims: </strong>To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR).</p><p><strong>Methods: </strong>We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as 'Good', 'Borderline' or 'Poor' quality.</p><p><strong>Results: </strong>Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10<sup>-3</sup>). Based on the consensus approach, 83.3% of ChatGPT-4's responses and 86.7% of ChatGPT-3.5's were rated as 'Good', surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10<sup>-2</sup>). ChatGPT-4 and ChatGPT-3.5 had no 'Poor' rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.</p><p><strong>Conclusion: </strong>ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.</p>","PeriodicalId":9313,"journal":{"name":"British Journal of Ophthalmology","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Ophthalmology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bjo-2023-324533","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background/aims: To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR).

Methods: We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as 'Good', 'Borderline' or 'Poor' quality.

Results: Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10-3). Based on the consensus approach, 83.3% of ChatGPT-4's responses and 86.7% of ChatGPT-3.5's were rated as 'Good', surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10-2). ChatGPT-4 and ChatGPT-3.5 had no 'Poor' rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others.

Conclusion: ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation.

比较基于生成和检索的聊天机器人在回答患者有关老年性黄斑变性和糖尿病视网膜病变的问题时的表现。
背景/目的比较生成型聊天机器人和检索型聊天机器人在回答患者有关年龄相关性黄斑变性(AMD)和糖尿病视网膜病变(DR)的咨询时的性能:我们在一项横断面研究中评估了四个聊天机器人:生成模型(ChatGPT-4、ChatGPT-3.5 和 Google Bard)和基于检索的模型(OcularBERT)。对它们回答 45 个问题(15 个 AMD、15 个 DR 和 15 个其他问题)的准确性进行了评估和比较。三位蒙面视网膜专家使用三点李克特量表对回答进行评分:2(良好,无差错)、1(边缘)或 0(差,有明显误差)。分数汇总后从 0 到 6 不等。根据评分者的多数共识,答复质量也被分为 "好"、"边缘 "或 "差":总体而言,ChatGPT-4 和 ChatGPT-3.5 的表现优于其他聊天机器人,两者的中位数分数(IQR)均为 6 (1),而 Google Bard 为 4.5 (2),OcularBERT 为 2 (1)(所有 p 均小于 8.4×10-3)。根据共识方法,83.3% 的 ChatGPT-4 和 86.7% 的 ChatGPT-3.5 回应被评为 "好",超过了 Google Bard(50%)和 OcularBERT(10%)(所有 p 均小于 1.4×10-2)。ChatGPT-4 和 ChatGPT-3.5 没有被评为 "差"。Google Bard 有 6.7% 的回答为 "差",OcularBERT 有 20% 的回答为 "差"。在所有问题类型中,ChatGPT-4 仅在 AMD 方面优于 Google Bard,而 ChatGPT-3.5 在 DR 和其他方面优于 Google Bard:结论:ChatGPT-4 和 ChatGPT-3.5 表现优异,其次是 Google Bard 和 OcularBERT。生成式聊天机器人有可能能够回答原始训练之外的特定领域问题。在实际应用之前,还需要进一步的验证研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
10.30
自引率
2.40%
发文量
213
审稿时长
3-6 weeks
期刊介绍: The British Journal of Ophthalmology (BJO) is an international peer-reviewed journal for ophthalmologists and visual science specialists. BJO publishes clinical investigations, clinical observations, and clinically relevant laboratory investigations related to ophthalmology. It also provides major reviews and also publishes manuscripts covering regional issues in a global context.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信