Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más comunes

IF 1.6 4区医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Revista Espanola De Medicina Nuclear E Imagen Molecular Pub Date : 2025-01-01 DOI:10.1016/j.remn.2024.500065

N. Aydinbelge-Dizdar , K. Dizdar

{"title":"Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más comunes","authors":"N. Aydinbelge-Dizdar , K. Dizdar","doi":"10.1016/j.remn.2024.500065","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>This study aimed to evaluate the reliability and readability of responses generated by two popular AI-chatbots, ‘ChatGPT-4.0’ and ‘Google Gemini’, to potential patient questions about PET/CT scans.</div></div><div><h3>Materials and methods</h3><div>Thirty potential questions for each of [<sup>18</sup>F]FDG and [<sup>68</sup>Ga]Ga-DOTA-SSTR PET/CT, and twenty-nine potential questions for [<sup>68</sup>Ga]Ga-PSMA PET/CT were asked separately to ChatGPT-4 and Gemini in May 2024. The responses were evaluated for reliability and readability using the modified DISCERN (mDISCERN) scale, Flesch Reading Ease (FRE), Gunning Fog Index (GFI), and Flesch-Kincaid Reading Grade Level (FKRGL). The inter-rater reliability of mDISCERN scores provided by three raters (ChatGPT-4, Gemini, and a nuclear medicine physician) for the responses was assessed.</div></div><div><h3>Results</h3><div>The median [min-max] mDISCERN scores reviewed by the physician for responses about FDG, PSMA and DOTA PET/CT scans were 3.5 [2-4], 3 [3-4], 3 [3-4] for ChatGPT-4 and 4 [2-5], 4 [2-5], 3.5 [3-5] for Gemini, respectively. The mDISCERN scores assessed using ChatGPT-4 for answers about FDG, PSMA, and DOTA-SSTR PET/CT scans were 3.5 [3-5], 3 [3-4], 3 [2-3] for ChatGPT-4, and 4 [3-5], 4 [3-5], 4 [3-5] for Gemini, respectively. The mDISCERN scores evaluated using Gemini for responses FDG, PSMA, and DOTA-SSTR PET/CTs were 3 [2-4], 2 [2-4], 3 [2-4] for ChatGPT-4, and 3 [2-5], 3 [1-5], 3 [2-5] for Gemini, respectively. The inter-rater reliability correlation coefficient of mDISCERN scores for ChatGPT-4 responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.629 (95% CI= 0,32-0,812), 0.707 (95% CI=0.458-0.853) and 0.738 (95% CI=0.519-0.866), respectively (p<0.001). The correlation coefficient of mDISCERN scores for Gemini responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.824 (95% CI=0.677-0.910), 0.881 (95% CI=0.78-0.94) and 0.847 (95% CI=0.719-0.922), respectively (p<0.001). The mDISCERN scores assessed by ChatGPT-4, Gemini, and the physician showed that the chatbots’ responses about all PET/CT scans had moderate to good statistical agreement according to the inter-rater reliability correlation coefficient (p<0,001). There was a statistically significant difference in all readability scores (FKRGL, GFI, and FRE) of ChatGPT-4 and Gemini responses about PET/CT scans (p<0,001). Gemini responses were shorter and had better readability scores than ChatGPT-4 responses.</div></div><div><h3>Conclusion</h3><div>There was an acceptable level of agreement between raters for the mDISCERN score, indicating agreement with the overall reliability of the responses. However, the information provided by AI-chatbots cannot be easily read by the public.</div></div>","PeriodicalId":48986,"journal":{"name":"Revista Espanola De Medicina Nuclear E Imagen Molecular","volume":"44 1","pages":"Article 500065"},"PeriodicalIF":1.6000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Revista Espanola De Medicina Nuclear E Imagen Molecular","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2253654X24000969","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

This study aimed to evaluate the reliability and readability of responses generated by two popular AI-chatbots, ‘ChatGPT-4.0’ and ‘Google Gemini’, to potential patient questions about PET/CT scans.

Materials and methods

Thirty potential questions for each of [¹⁸F]FDG and [⁶⁸Ga]Ga-DOTA-SSTR PET/CT, and twenty-nine potential questions for [⁶⁸Ga]Ga-PSMA PET/CT were asked separately to ChatGPT-4 and Gemini in May 2024. The responses were evaluated for reliability and readability using the modified DISCERN (mDISCERN) scale, Flesch Reading Ease (FRE), Gunning Fog Index (GFI), and Flesch-Kincaid Reading Grade Level (FKRGL). The inter-rater reliability of mDISCERN scores provided by three raters (ChatGPT-4, Gemini, and a nuclear medicine physician) for the responses was assessed.

Results

The median [min-max] mDISCERN scores reviewed by the physician for responses about FDG, PSMA and DOTA PET/CT scans were 3.5 [2-4], 3 [3-4], 3 [3-4] for ChatGPT-4 and 4 [2-5], 4 [2-5], 3.5 [3-5] for Gemini, respectively. The mDISCERN scores assessed using ChatGPT-4 for answers about FDG, PSMA, and DOTA-SSTR PET/CT scans were 3.5 [3-5], 3 [3-4], 3 [2-3] for ChatGPT-4, and 4 [3-5], 4 [3-5], 4 [3-5] for Gemini, respectively. The mDISCERN scores evaluated using Gemini for responses FDG, PSMA, and DOTA-SSTR PET/CTs were 3 [2-4], 2 [2-4], 3 [2-4] for ChatGPT-4, and 3 [2-5], 3 [1-5], 3 [2-5] for Gemini, respectively. The inter-rater reliability correlation coefficient of mDISCERN scores for ChatGPT-4 responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.629 (95% CI= 0,32-0,812), 0.707 (95% CI = 0.458-0.853) and 0.738 (95% CI = 0.519-0.866), respectively (p< 0.001). The correlation coefficient of mDISCERN scores for Gemini responses about FDG, PSMA, and DOTA-SSTR PET/CT scans were 0.824 (95% CI = 0.677-0.910), 0.881 (95% CI = 0.78-0.94) and 0.847 (95% CI = 0.719-0.922), respectively (p< 0.001). The mDISCERN scores assessed by ChatGPT-4, Gemini, and the physician showed that the chatbots’ responses about all PET/CT scans had moderate to good statistical agreement according to the inter-rater reliability correlation coefficient (p< 0,001). There was a statistically significant difference in all readability scores (FKRGL, GFI, and FRE) of ChatGPT-4 and Gemini responses about PET/CT scans (p< 0,001). Gemini responses were shorter and had better readability scores than ChatGPT-4 responses.

Conclusion

There was an acceptable level of agreement between raters for the mDISCERN score, indicating agreement with the overall reliability of the responses. However, the information provided by AI-chatbots cannot be easily read by the public.

查看原文本刊更多论文

评估聊天机器人作为最常见的PET-TC扫描的患者信息来源的响应的可靠性和可读性

本研究旨在评估两种流行的人工智能聊天机器人“ChatGPT-4.0”和“谷歌Gemini”对患者关于PET/CT扫描的潜在问题所产生的回答的可靠性和可读性。材料与方法于2024年5月分别向ChatGPT-4和Gemini询问[18F]FDG和[68Ga]Ga-DOTA-SSTR PET/CT各30个潜在问题和[68Ga]Ga-PSMA PET/CT 29个潜在问题。采用改进的DISCERN （mDISCERN）量表、Flesch Reading Ease （FRE）、Gunning Fog Index （GFI）和Flesch- kincaid Reading Grade Level （FKRGL）评估问卷的可靠性和可读性。评估了三位评价者（ChatGPT-4、Gemini和一名核医学医师）提供的mDISCERN评分的评分间信度。结果医生对FDG、PSMA和DOTA PET/CT扫描反应的mDISCERN评分中位数[最小-最大]分别为ChatGPT-4的3.5[2-4]、3[3-4]、3 [3-4]，Gemini的4[2-5]、4[2-5]、3.5[3-5]。使用ChatGPT-4对FDG、PSMA和DOTA-SSTR PET/CT扫描的答案进行mDISCERN评分，ChatGPT-4的得分分别为3.5[3-5]、3[3-4]、3 [2-3]，Gemini的得分分别为4[3-5]、4[3-5]、4[3-5]。使用Gemini评估FDG、PSMA和DOTA-SSTR PET/ ct的mDISCERN得分分别为ChatGPT-4的3[2-4]、2[2-4]、3 [2-4]，Gemini的3[2-5]、3[1-5]、3[2-5]。ChatGPT-4对FDG、PSMA和DOTA-SSTR PET/CT扫描反应的mDISCERN评分的评分间信度相关系数分别为0.629 （95% CI= 0,32-0,812）、0.707 （95% CI= 0.458-0.853）和0.738 (95% CI= 0.519-0.866) (p<；0.001)。mDISCERN评分与双子座FDG、PSMA和DOTA-SSTR PET/CT扫描反应的相关系数分别为0.824 （95% CI = 0.677-0.910）、0.881 （95% CI = 0.78-0.94）和0.847 (95% CI = 0.719-0.922) (p<；0.001)。由ChatGPT-4、Gemini和医生评估的mDISCERN分数表明，聊天机器人对所有PET/CT扫描的反应具有中等到良好的统计一致性，根据评分间信度相关系数(p<；0001)。ChatGPT-4和Gemini对PET/CT扫描的所有可读性评分（FKRGL、GFI和FRE）存在统计学上的显著差异(p<；0001)。双子座的回答比ChatGPT-4的回答更短，可读性得分更高。结论评定者对mDISCERN评分的一致性在一个可接受的水平上，表明其总体可靠性是一致的。然而，人工智能聊天机器人提供的信息并不容易被公众阅读。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Revista Espanola De Medicina Nuclear E Imagen Molecular RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-

CiteScore

1.10

自引率

16.70%

发文量

审稿时长

24 days

期刊介绍： The Revista Española de Medicina Nuclear e Imagen Molecular (Spanish Journal of Nuclear Medicine and Molecular Imaging), was founded in 1982, and is the official journal of the Spanish Society of Nuclear Medicine and Molecular Imaging, which has more than 700 members. The Journal, which publishes 6 regular issues per year, has the promotion of research and continuing education in all fields of Nuclear Medicine as its main aim. For this, its principal sections are Originals, Clinical Notes, Images of Interest, and Special Collaboration articles.