Reliability and readability analysis of ChatGPT-4 and Google Bard as a patient information source for the most commonly applied radionuclide treatments in cancer patients

Revista espanola de medicina nuclear e imagen molecular Pub Date : 2024-07-01 DOI:10.1016/j.remnie.2024.500021

{"title":"Reliability and readability analysis of ChatGPT-4 and Google Bard as a patient information source for the most commonly applied radionuclide treatments in cancer patients","authors":"","doi":"10.1016/j.remnie.2024.500021","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>Searching for online health information is a popular approach employed by patients to enhance their knowledge for their diseases. Recently developed AI chatbots are probably the easiest way in this regard. The purpose of the study is to analyze the reliability and readability of AI chatbot responses in terms of the most commonly applied radionuclide treatments in cancer patients.</p></div><div><h3>Methods</h3><p>Basic patient questions, thirty about RAI, PRRT and TARE treatments and twenty-nine about PSMA-TRT, were asked one by one to GPT-4 and Bard on January 2024. The reliability and readability of the responses were assessed by using DISCERN scale, Flesch Reading Ease(FRE) and Flesch-Kincaid Reading Grade Level(FKRGL).</p></div><div><h3>Results</h3><p><span><span>The mean (SD) FKRGL scores for the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatmens were 14.57 (1.19), 14.65 (1.38), 14.25 (1.10), 14.38 (1.2) and 11.49 (1.59), 12.42 (1.71), 11.35 (1.80), 13.01 (1.97), respectively. In terms of readability the FRKGL scores of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatments were above the general public reading grade level. The mean (SD) DISCERN scores assesses by nuclear medicine phsician for the responses of GPT-4 and Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 47.86 (5.09), 48.48 (4.22), 46.76 (4.09), 48.33 (5.15) and 51.50 (5.64), 53.44 (5.42), 53 (6.36), 49.43 (5.32), respectively. Based on mean DISCERN scores, the reliability of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT, and TARE treatments ranged from fair to good. The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and </span>nuclear medicine physician for the responses of GPT-4 about RAI, PSMA-TRT, PRRT and TARE treatments were 0.512(95% CI 0.296: 0.704), 0.695(95% CI 0.518: 0.829), 0.687(95% CI 0.511: 0.823) and 0.649 (95% CI 0.462: 0.798), respectively (</span><em>p</em> < 0.01). The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 0.753(95% CI 0.602: 0.863), 0.812(95% CI 0.686: 0.899), 0.804(95% CI 0.677: 0.894) and 0.671 (95% CI 0.489: 0.812), respectively (<em>p</em> < 0.01). The inter-rater reliability for the responses of Bard and GPT-4 about RAİ, PSMA-TRT, PRRT and TARE treatments were moderate to good. Further, consulting to the nuclear medicine physician was rarely emphasized both in GPT-4 and Google Bard and references were included in some responses of Google Bard, but there were no references in GPT-4.</p></div><div><h3>Conclusion</h3><p>Although the information provided by AI chatbots may be acceptable in medical terms, it can not be easy to read for the general public, which may prevent it from being understandable. Effective prompts using 'prompt engineering' may refine the responses in a more comprehensible manner. Since radionuclide treatments are specific to nuclear medicine expertise, nuclear medicine physician need to be stated as a consultant in responses in order to guide patients and caregivers to obtain accurate medical advice. Referencing is significant in terms of confidence and satisfaction of patients and caregivers seeking information.</p></div>","PeriodicalId":94197,"journal":{"name":"Revista espanola de medicina nuclear e imagen molecular","volume":"43 4","pages":"Article 500021"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Revista espanola de medicina nuclear e imagen molecular","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2253808924000375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

Searching for online health information is a popular approach employed by patients to enhance their knowledge for their diseases. Recently developed AI chatbots are probably the easiest way in this regard. The purpose of the study is to analyze the reliability and readability of AI chatbot responses in terms of the most commonly applied radionuclide treatments in cancer patients.

Methods

Basic patient questions, thirty about RAI, PRRT and TARE treatments and twenty-nine about PSMA-TRT, were asked one by one to GPT-4 and Bard on January 2024. The reliability and readability of the responses were assessed by using DISCERN scale, Flesch Reading Ease(FRE) and Flesch-Kincaid Reading Grade Level(FKRGL).

Results

The mean (SD) FKRGL scores for the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatmens were 14.57 (1.19), 14.65 (1.38), 14.25 (1.10), 14.38 (1.2) and 11.49 (1.59), 12.42 (1.71), 11.35 (1.80), 13.01 (1.97), respectively. In terms of readability the FRKGL scores of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatments were above the general public reading grade level. The mean (SD) DISCERN scores assesses by nuclear medicine phsician for the responses of GPT-4 and Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 47.86 (5.09), 48.48 (4.22), 46.76 (4.09), 48.33 (5.15) and 51.50 (5.64), 53.44 (5.42), 53 (6.36), 49.43 (5.32), respectively. Based on mean DISCERN scores, the reliability of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT, and TARE treatments ranged from fair to good. The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of GPT-4 about RAI, PSMA-TRT, PRRT and TARE treatments were 0.512(95% CI 0.296: 0.704), 0.695(95% CI 0.518: 0.829), 0.687(95% CI 0.511: 0.823) and 0.649 (95% CI 0.462: 0.798), respectively (p < 0.01). The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 0.753(95% CI 0.602: 0.863), 0.812(95% CI 0.686: 0.899), 0.804(95% CI 0.677: 0.894) and 0.671 (95% CI 0.489: 0.812), respectively (p < 0.01). The inter-rater reliability for the responses of Bard and GPT-4 about RAİ, PSMA-TRT, PRRT and TARE treatments were moderate to good. Further, consulting to the nuclear medicine physician was rarely emphasized both in GPT-4 and Google Bard and references were included in some responses of Google Bard, but there were no references in GPT-4.

Conclusion

Although the information provided by AI chatbots may be acceptable in medical terms, it can not be easy to read for the general public, which may prevent it from being understandable. Effective prompts using 'prompt engineering' may refine the responses in a more comprehensible manner. Since radionuclide treatments are specific to nuclear medicine expertise, nuclear medicine physician need to be stated as a consultant in responses in order to guide patients and caregivers to obtain accurate medical advice. Referencing is significant in terms of confidence and satisfaction of patients and caregivers seeking information.

查看原文本刊更多论文

GPT-4 和 google bard 作为癌症患者最常用放射性核素治疗的患者信息来源的可靠性和可读性分析。

目的：搜索在线健康信息是患者常用的一种方法，以增强他们对疾病的了解。最近开发的人工智能聊天机器人可能是这方面最简单的方法。本研究的目的是分析人工智能聊天机器人在癌症患者最常用的放射性核素治疗方面的回答的可靠性和可读性：方法：在 2024 年 1 月向 GPT-4 和 Bard 逐一询问了患者的基本问题，其中 30 个是关于 RAI、PRRT 和 TARE 治疗的，29 个是关于 PSMA-TRT 的。采用 DISCERN 量表、Flesch Reading Ease（FRE）和 Flesch-Kincaid Reading Grade Level（FKRGL）对回答的可靠性和可读性进行了评估：GPT-4和谷歌巴德关于RAI、PSMA-TRT、PRRT和TARE治疗的平均（标清）FKRGL分数分别为14.57（1.19）、14.65（1.38）、14.25（1.10）、14.38（1.2）和11.49（1.59）、12.42（1.71）、11.35（1.80）、13.01（1.97）。就可读性而言，关于 RAI、PSMA-TRT、PRRT 和 TARE 治疗的 GPT-4 和 Google Bard 的 FRKGL 分数高于一般公众的阅读水平。核医学医生对 GPT-4 和谷歌巴德关于 RAI、PSMA-TRT、PRRT 和 TARE 治疗的回答进行评估后得出的 DISCERN 平均分（标度）分别为 47.86（5.09）、48.48（4.22）、46.76（4.09）、48.33（5.15）和 51.50（5.64）、53.44（5.42）、53（6.36）、49.43（5.32）。根据 DISCERN 平均得分，GPT-4 和 Google Bard 关于 RAI、PSMA-TRT、PRRT 和 TARE 治疗的回答的可靠性从一般到良好不等。由 GPT-4、谷歌巴德和核医学医生对 GPT-4 关于 RAI、PSMA-TRT、PRRT 和 TARE 治疗的回答所评估的 DISCERN 分数的评分者间可靠性相关系数分别为 0.512（95% CI 0.296：0.704）、0.695（95% CI 0.518：0.829）、0.687（95% CI 0.511：0.823）和 0.649（95% CI 0.462：0.798）（P 结论：GPT-4、谷歌巴德和核医学医生对 GPT-4 的回答所评估的 DISCERN 分数的评分者间可靠性相关系数分别为 0.512（95% CI 0.296：0.704）、0.695（95% CI 0.518：0.829）、0.687（95% CI 0.511：0.823）和 0.649（95% CI 0.462：0.798）：虽然人工智能聊天机器人提供的信息在医学术语上可以接受，但对于普通大众来说却不容易阅读，这可能会妨碍其理解。使用 "提示工程 "进行有效提示，可以以更易于理解的方式完善回复。由于放射性核素治疗是核医学专业知识的特定内容，因此需要在回答中说明核医学医生是顾问，以指导患者和护理人员获得准确的医疗建议。就患者和护理人员寻求信息的信心和满意度而言，参考意义重大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Revista espanola de medicina nuclear e imagen molecular

自引率

0.00%

发文量