Reliability and readability analysis of ChatGPT-4 and Google Bard as a patient information source for the most commonly applied radionuclide treatments in cancer patients

{"title":"Reliability and readability analysis of ChatGPT-4 and Google Bard as a patient information source for the most commonly applied radionuclide treatments in cancer patients","authors":"","doi":"10.1016/j.remnie.2024.500021","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>Searching for online health information is a popular approach employed by patients to enhance their knowledge for their diseases. Recently developed AI chatbots are probably the easiest way in this regard. The purpose of the study is to analyze the reliability and readability of AI chatbot responses in terms of the most commonly applied radionuclide treatments in cancer patients.</p></div><div><h3>Methods</h3><p>Basic patient questions, thirty about RAI, PRRT and TARE treatments and twenty-nine about PSMA-TRT, were asked one by one to GPT-4 and Bard on January 2024. The reliability and readability of the responses were assessed by using DISCERN scale, Flesch Reading Ease(FRE) and Flesch-Kincaid Reading Grade Level(FKRGL).</p></div><div><h3>Results</h3><p><span><span>The mean (SD) FKRGL scores for the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatmens were 14.57 (1.19), 14.65 (1.38), 14.25 (1.10), 14.38 (1.2) and 11.49 (1.59), 12.42 (1.71), 11.35 (1.80), 13.01 (1.97), respectively. In terms of readability the FRKGL scores of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatments were above the general public reading grade level. The mean (SD) DISCERN scores assesses by nuclear medicine phsician for the responses of GPT-4 and Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 47.86 (5.09), 48.48 (4.22), 46.76 (4.09), 48.33 (5.15) and 51.50 (5.64), 53.44 (5.42), 53 (6.36), 49.43 (5.32), respectively. Based on mean DISCERN scores, the reliability of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT, and TARE treatments ranged from fair to good. The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and </span>nuclear medicine physician for the responses of GPT-4 about RAI, PSMA-TRT, PRRT and TARE treatments were 0.512(95% CI 0.296: 0.704), 0.695(95% CI 0.518: 0.829), 0.687(95% CI 0.511: 0.823) and 0.649 (95% CI 0.462: 0.798), respectively (</span><em>p</em> &lt; 0.01). The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 0.753(95% CI 0.602: 0.863), 0.812(95% CI 0.686: 0.899), 0.804(95% CI 0.677: 0.894) and 0.671 (95% CI 0.489: 0.812), respectively (<em>p</em> &lt; 0.01). The inter-rater reliability for the responses of Bard and GPT-4 about RAİ, PSMA-TRT, PRRT and TARE treatments were moderate to good. Further, consulting to the nuclear medicine physician was rarely emphasized both in GPT-4 and Google Bard and references were included in some responses of Google Bard, but there were no references in GPT-4.</p></div><div><h3>Conclusion</h3><p>Although the information provided by AI chatbots may be acceptable in medical terms, it can not be easy to read for the general public, which may prevent it from being understandable. Effective prompts using 'prompt engineering' may refine the responses in a more comprehensible manner. Since radionuclide treatments are specific to nuclear medicine expertise, nuclear medicine physician need to be stated as a consultant in responses in order to guide patients and caregivers to obtain accurate medical advice. Referencing is significant in terms of confidence and satisfaction of patients and caregivers seeking information.</p></div>","PeriodicalId":94197,"journal":{"name":"Revista espanola de medicina nuclear e imagen molecular","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Revista espanola de medicina nuclear e imagen molecular","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2253808924000375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose

Searching for online health information is a popular approach employed by patients to enhance their knowledge for their diseases. Recently developed AI chatbots are probably the easiest way in this regard. The purpose of the study is to analyze the reliability and readability of AI chatbot responses in terms of the most commonly applied radionuclide treatments in cancer patients.

Methods

Basic patient questions, thirty about RAI, PRRT and TARE treatments and twenty-nine about PSMA-TRT, were asked one by one to GPT-4 and Bard on January 2024. The reliability and readability of the responses were assessed by using DISCERN scale, Flesch Reading Ease(FRE) and Flesch-Kincaid Reading Grade Level(FKRGL).

Results

The mean (SD) FKRGL scores for the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatmens were 14.57 (1.19), 14.65 (1.38), 14.25 (1.10), 14.38 (1.2) and 11.49 (1.59), 12.42 (1.71), 11.35 (1.80), 13.01 (1.97), respectively. In terms of readability the FRKGL scores of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatments were above the general public reading grade level. The mean (SD) DISCERN scores assesses by nuclear medicine phsician for the responses of GPT-4 and Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 47.86 (5.09), 48.48 (4.22), 46.76 (4.09), 48.33 (5.15) and 51.50 (5.64), 53.44 (5.42), 53 (6.36), 49.43 (5.32), respectively. Based on mean DISCERN scores, the reliability of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT, and TARE treatments ranged from fair to good. The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of GPT-4 about RAI, PSMA-TRT, PRRT and TARE treatments were 0.512(95% CI 0.296: 0.704), 0.695(95% CI 0.518: 0.829), 0.687(95% CI 0.511: 0.823) and 0.649 (95% CI 0.462: 0.798), respectively (p < 0.01). The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 0.753(95% CI 0.602: 0.863), 0.812(95% CI 0.686: 0.899), 0.804(95% CI 0.677: 0.894) and 0.671 (95% CI 0.489: 0.812), respectively (p < 0.01). The inter-rater reliability for the responses of Bard and GPT-4 about RAİ, PSMA-TRT, PRRT and TARE treatments were moderate to good. Further, consulting to the nuclear medicine physician was rarely emphasized both in GPT-4 and Google Bard and references were included in some responses of Google Bard, but there were no references in GPT-4.

Conclusion

Although the information provided by AI chatbots may be acceptable in medical terms, it can not be easy to read for the general public, which may prevent it from being understandable. Effective prompts using 'prompt engineering' may refine the responses in a more comprehensible manner. Since radionuclide treatments are specific to nuclear medicine expertise, nuclear medicine physician need to be stated as a consultant in responses in order to guide patients and caregivers to obtain accurate medical advice. Referencing is significant in terms of confidence and satisfaction of patients and caregivers seeking information.

GPT-4 和 google bard 作为癌症患者最常用放射性核素治疗的患者信息来源的可靠性和可读性分析。
目的:搜索在线健康信息是患者常用的一种方法,以增强他们对疾病的了解。最近开发的人工智能聊天机器人可能是这方面最简单的方法。本研究的目的是分析人工智能聊天机器人在癌症患者最常用的放射性核素治疗方面的回答的可靠性和可读性:方法:在 2024 年 1 月向 GPT-4 和 Bard 逐一询问了患者的基本问题,其中 30 个是关于 RAI、PRRT 和 TARE 治疗的,29 个是关于 PSMA-TRT 的。采用 DISCERN 量表、Flesch Reading Ease(FRE)和 Flesch-Kincaid Reading Grade Level(FKRGL)对回答的可靠性和可读性进行了评估:GPT-4和谷歌巴德关于RAI、PSMA-TRT、PRRT和TARE治疗的平均(标清)FKRGL分数分别为14.57(1.19)、14.65(1.38)、14.25(1.10)、14.38(1.2)和11.49(1.59)、12.42(1.71)、11.35(1.80)、13.01(1.97)。就可读性而言,关于 RAI、PSMA-TRT、PRRT 和 TARE 治疗的 GPT-4 和 Google Bard 的 FRKGL 分数高于一般公众的阅读水平。核医学医生对 GPT-4 和谷歌巴德关于 RAI、PSMA-TRT、PRRT 和 TARE 治疗的回答进行评估后得出的 DISCERN 平均分(标度)分别为 47.86(5.09)、48.48(4.22)、46.76(4.09)、48.33(5.15)和 51.50(5.64)、53.44(5.42)、53(6.36)、49.43(5.32)。根据 DISCERN 平均得分,GPT-4 和 Google Bard 关于 RAI、PSMA-TRT、PRRT 和 TARE 治疗的回答的可靠性从一般到良好不等。由 GPT-4、谷歌巴德和核医学医生对 GPT-4 关于 RAI、PSMA-TRT、PRRT 和 TARE 治疗的回答所评估的 DISCERN 分数的评分者间可靠性相关系数分别为 0.512(95% CI 0.296:0.704)、0.695(95% CI 0.518:0.829)、0.687(95% CI 0.511:0.823)和 0.649(95% CI 0.462:0.798)(P 结论:GPT-4、谷歌巴德和核医学医生对 GPT-4 的回答所评估的 DISCERN 分数的评分者间可靠性相关系数分别为 0.512(95% CI 0.296:0.704)、0.695(95% CI 0.518:0.829)、0.687(95% CI 0.511:0.823)和 0.649(95% CI 0.462:0.798):虽然人工智能聊天机器人提供的信息在医学术语上可以接受,但对于普通大众来说却不容易阅读,这可能会妨碍其理解。使用 "提示工程 "进行有效提示,可以以更易于理解的方式完善回复。由于放射性核素治疗是核医学专业知识的特定内容,因此需要在回答中说明核医学医生是顾问,以指导患者和护理人员获得准确的医疗建议。就患者和护理人员寻求信息的信心和满意度而言,参考意义重大。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信