评估ChatGPT对放疗相关患者询问反应的质量和可靠性：与GPT-3.5和GPT-4的比较研究

IF 2.7 Q2 ONCOLOGY

JMIR Cancer Pub Date : 2025-04-16 DOI:10.2196/63677

Ana Grilo, Catarina Marques, Maria Corte-Real, Elisabete Carolino, Marco Caetano

{"title":"评估ChatGPT对放疗相关患者询问反应的质量和可靠性：与GPT-3.5和GPT-4的比较研究","authors":"Ana Grilo, Catarina Marques, Maria Corte-Real, Elisabete Carolino, Marco Caetano","doi":"10.2196/63677","DOIUrl":null,"url":null,"abstract":"Background: Patients frequently resort to the internet to access information about cancer. However, these websites often lack content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, has signified a potential paradigm shift in how patients with cancer can access vast amounts of medical information, including insights into radiotherapy. However, the quality of the information provided by ChatGPT remains unclear. This is particularly significant given the general public's limited knowledge of this treatment and concerns about its possible side effects. Furthermore, evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and result in delays in receiving appropriate treatment.Objective: This study aims to evaluate the quality and reliability of ChatGPT's responses to common patient queries about radiotherapy, comparing the performance of ChatGPT's two versions: GPT-3.5 and GPT-4.Methods: We selected 40 commonly asked radiotherapy questions and entered the queries in both versions of ChatGPT. Response quality and reliability were evaluated by 16 radiotherapy experts using the General Quality Score (GQS), a 5-point Likert scale, with the median GQS determined based on the experts' ratings. Consistency and similarity of responses were assessed using the cosine similarity score, which ranges from 0 (complete dissimilarity) to 1 (complete similarity). Readability was analyzed using the Flesch Reading Ease Score, ranging from 0 to 100, and the Flesch-Kincaid Grade Level, reflecting the average number of years of education required for comprehension. Statistical analyses were performed using the Mann-Whitney test and effect size, with results deemed significant at a 5% level (P=.05). To assess agreement between experts, Krippendorff α and Fleiss κ were used.Results: GPT-4 demonstrated superior performance, with a higher GQS and a lower number of scores of 1 and 2, compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The median (IQR) cosine similarity score indicated substantial similarity (0.81, IQR 0.05) and consistency in the responses of both versions (GPT-3.5: 0.85, IQR 0.04; GPT-4: 0.83, IQR 0.04). Readability scores for both versions were considered college level, with GPT-4 scoring slightly better in the Flesch Reading Ease Score (34.61) and Flesch-Kincaid Grade Level (12.32) compared to GPT-3.5 (32.98 and 13.32, respectively). Responses by both versions were deemed challenging for the general public.Conclusions: Both GPT-3.5 and GPT-4 demonstrated having the capability to address radiotherapy concepts, with GPT-4 showing superior performance. However, both models present readability challenges for the general population. Although ChatGPT demonstrates potential as a valuable resource for addressing common patient queries related to radiotherapy, it is imperative to acknowledge its limitations, including the risks of misinformation and readability issues. In addition, its implementation should be supported by strategies to enhance accessibility and readability.","PeriodicalId":45538,"journal":{"name":"JMIR Cancer","volume":"11 ","pages":"e63677"},"PeriodicalIF":2.7000,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12017613/pdf/","citationCount":"0","resultStr":"{\"title\":\"Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.\",\"authors\":\"Ana Grilo, Catarina Marques, Maria Corte-Real, Elisabete Carolino, Marco Caetano\",\"doi\":\"10.2196/63677\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Patients frequently resort to the internet to access information about cancer. However, these websites often lack content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, has signified a potential paradigm shift in how patients with cancer can access vast amounts of medical information, including insights into radiotherapy. However, the quality of the information provided by ChatGPT remains unclear. This is particularly significant given the general public's limited knowledge of this treatment and concerns about its possible side effects. Furthermore, evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and result in delays in receiving appropriate treatment.Objective: This study aims to evaluate the quality and reliability of ChatGPT's responses to common patient queries about radiotherapy, comparing the performance of ChatGPT's two versions: GPT-3.5 and GPT-4.Methods: We selected 40 commonly asked radiotherapy questions and entered the queries in both versions of ChatGPT. Response quality and reliability were evaluated by 16 radiotherapy experts using the General Quality Score (GQS), a 5-point Likert scale, with the median GQS determined based on the experts' ratings. Consistency and similarity of responses were assessed using the cosine similarity score, which ranges from 0 (complete dissimilarity) to 1 (complete similarity). Readability was analyzed using the Flesch Reading Ease Score, ranging from 0 to 100, and the Flesch-Kincaid Grade Level, reflecting the average number of years of education required for comprehension. Statistical analyses were performed using the Mann-Whitney test and effect size, with results deemed significant at a 5% level (P=.05). To assess agreement between experts, Krippendorff α and Fleiss κ were used.Results: GPT-4 demonstrated superior performance, with a higher GQS and a lower number of scores of 1 and 2, compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The median (IQR) cosine similarity score indicated substantial similarity (0.81, IQR 0.05) and consistency in the responses of both versions (GPT-3.5: 0.85, IQR 0.04; GPT-4: 0.83, IQR 0.04). Readability scores for both versions were considered college level, with GPT-4 scoring slightly better in the Flesch Reading Ease Score (34.61) and Flesch-Kincaid Grade Level (12.32) compared to GPT-3.5 (32.98 and 13.32, respectively). Responses by both versions were deemed challenging for the general public.Conclusions: Both GPT-3.5 and GPT-4 demonstrated having the capability to address radiotherapy concepts, with GPT-4 showing superior performance. However, both models present readability challenges for the general population. Although ChatGPT demonstrates potential as a valuable resource for addressing common patient queries related to radiotherapy, it is imperative to acknowledge its limitations, including the risks of misinformation and readability issues. In addition, its implementation should be supported by strategies to enhance accessibility and readability.\",\"PeriodicalId\":45538,\"journal\":{\"name\":\"JMIR Cancer\",\"volume\":\"11 \",\"pages\":\"e63677\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-04-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12017613/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Cancer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/63677\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Cancer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/63677","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：患者经常通过互联网获取有关癌症的信息。然而，这些网站往往缺乏内容的准确性和可读性。最近，人工智能聊天机器人ChatGPT标志着癌症患者如何获取大量医疗信息（包括对放射治疗的见解）的潜在范式转变。然而，ChatGPT提供的信息的质量仍然不清楚。考虑到公众对这种疗法的知识有限，以及对其可能产生的副作用的担忧，这一点尤为重要。此外，评估反应的质量是至关重要的，因为错误的信息可以培养一种错误的知识和安全感，导致不遵守规定，并导致接受适当治疗的延误。目的：本研究旨在通过比较ChatGPT的两个版本GPT-3.5和GPT-4的性能，评估ChatGPT对患者常见放疗问题的回答的质量和可靠性。方法：选取40个放疗常见问题，分别在ChatGPT和ChatGPT中录入。响应质量和可靠性由16名放疗专家使用一般质量评分（GQS）进行评估，GQS是一种5分制的李克特量表，GQS的中位数根据专家的评分确定。使用余弦相似度评分评估回答的一致性和相似性，其范围从0（完全不相似）到1（完全相似）。可读性采用Flesch Reading Ease Score（从0到100）和Flesch- kincaid Grade Level（反映理解所需的平均教育年数）进行分析。采用Mann-Whitney检验和效应量进行统计分析，在5%的水平上认为结果显著（P= 0.05）。为了评估专家之间的一致性，使用Krippendorff α和Fleiss κ。结果：与GPT-3.5相比，GPT-4表现出更高的GQS和更少的1分和2分。曼-惠特尼测验显示，在某些问题上存在统计学上的显著差异，GPT-4通常获得更高的评分。中位数（IQR）余弦相似度评分显示两个版本的反应具有显著的相似性（0.81,IQR 0.05）和一致性(GPT-3.5: 0.85, IQR 0.04；Gpt-4: 0.83, iqr 0.04)。两个版本的可读性得分都被认为是大学水平，GPT-4在Flesch Reading Ease Score（34.61）和Flesch- kincaid Grade level（12.32）上的得分略高于GPT-3.5（分别为32.98和13.32）。两个版本的回答都被认为是对公众的挑战。结论：GPT-3.5和GPT-4均具有解决放疗概念的能力，其中GPT-4表现出更好的性能。然而，这两种模型都对一般人群的可读性提出了挑战。尽管ChatGPT显示了作为解决与放射治疗相关的常见患者问题的宝贵资源的潜力，但必须承认其局限性，包括错误信息和可读性问题的风险。此外，它的执行应得到提高可及性和易读性的战略的支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.

查看原文本刊更多论文

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.

Background: Patients frequently resort to the internet to access information about cancer. However, these websites often lack content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, has signified a potential paradigm shift in how patients with cancer can access vast amounts of medical information, including insights into radiotherapy. However, the quality of the information provided by ChatGPT remains unclear. This is particularly significant given the general public's limited knowledge of this treatment and concerns about its possible side effects. Furthermore, evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and result in delays in receiving appropriate treatment.

Objective: This study aims to evaluate the quality and reliability of ChatGPT's responses to common patient queries about radiotherapy, comparing the performance of ChatGPT's two versions: GPT-3.5 and GPT-4.

Methods: We selected 40 commonly asked radiotherapy questions and entered the queries in both versions of ChatGPT. Response quality and reliability were evaluated by 16 radiotherapy experts using the General Quality Score (GQS), a 5-point Likert scale, with the median GQS determined based on the experts' ratings. Consistency and similarity of responses were assessed using the cosine similarity score, which ranges from 0 (complete dissimilarity) to 1 (complete similarity). Readability was analyzed using the Flesch Reading Ease Score, ranging from 0 to 100, and the Flesch-Kincaid Grade Level, reflecting the average number of years of education required for comprehension. Statistical analyses were performed using the Mann-Whitney test and effect size, with results deemed significant at a 5% level (P=.05). To assess agreement between experts, Krippendorff α and Fleiss κ were used.

Results: GPT-4 demonstrated superior performance, with a higher GQS and a lower number of scores of 1 and 2, compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The median (IQR) cosine similarity score indicated substantial similarity (0.81, IQR 0.05) and consistency in the responses of both versions (GPT-3.5: 0.85, IQR 0.04; GPT-4: 0.83, IQR 0.04). Readability scores for both versions were considered college level, with GPT-4 scoring slightly better in the Flesch Reading Ease Score (34.61) and Flesch-Kincaid Grade Level (12.32) compared to GPT-3.5 (32.98 and 13.32, respectively). Responses by both versions were deemed challenging for the general public.

Conclusions: Both GPT-3.5 and GPT-4 demonstrated having the capability to address radiotherapy concepts, with GPT-4 showing superior performance. However, both models present readability challenges for the general population. Although ChatGPT demonstrates potential as a valuable resource for addressing common patient queries related to radiotherapy, it is imperative to acknowledge its limitations, including the risks of misinformation and readability issues. In addition, its implementation should be supported by strategies to enhance accessibility and readability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

JMIR Cancer ONCOLOGY-

CiteScore

4.10

自引率

0.00%

发文量

审稿时长

12 weeks