ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance.

IF 2.9 3区医学 Q2 HEALTH CARE SCIENCES & SERVICES

DIGITAL HEALTH Pub Date : 2024-08-14 eCollection Date: 2024-01-01 DOI:10.1177/20552076241269538

Chung-You Tsai, Pai-Yu Cheng, Juinn-Horng Deng, Fu-Shan Jaw, Shyi-Chun Yii

{"title":"ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance.","authors":"Chung-You Tsai, Pai-Yu Cheng, Juinn-Horng Deng, Fu-Shan Jaw, Shyi-Chun Yii","doi":"10.1177/20552076241269538","DOIUrl":null,"url":null,"abstract":"Objectives: To assess the quality and alignment of ChatGPT's cancer treatment recommendations (RECs) with National Comprehensive Cancer Network (NCCN) guidelines and expert opinions.Methods: Three urologists performed quantitative and qualitative assessments in October 2023 analyzing responses from ChatGPT-4 and ChatGPT-3.5 to 108 prostate, kidney, and bladder cancer prompts using two zero-shot prompt templates. Performance evaluation involved calculating five ratios: expert-approved/expert-disagreed and NCCN-aligned RECs against total ChatGPT RECs plus coverage and adherence rates to NCCN. Experts rated the response's quality on a 1-5 scale considering correctness, comprehensiveness, specificity, and appropriateness.Results: ChatGPT-4 outperformed ChatGPT-3.5 in prostate cancer inquiries, with an average word count of 317.3 versus 124.4 (p < 0.001) and 6.1 versus 3.9 RECs (p < 0.001). Its rater-approved REC ratio (96.1% vs. 89.4%) and alignment with NCCN guidelines (76.8% vs. 49.1%, p = 0.001) were superior and scored significantly better on all quality dimensions. Across 108 prompts covering three cancers, ChatGPT-4 produced an average of 6.0 RECs per case, with an 88.5% approval rate from raters, 86.7% NCCN concordance, and only a 9.5% disagreement rate. It achieved high marks in correctness (4.5), comprehensiveness (4.4), specificity (4.0), and appropriateness (4.4). Subgroup analyses across cancer types, disease statuses, and different prompt templates were reported.Conclusions: ChatGPT-4 demonstrated significant improvement in providing accurate and detailed treatment recommendations for urological cancers in line with clinical guidelines and expert opinion. However, it is vital to recognize that AI tools are not without flaws and should be utilized with caution. ChatGPT could supplement, but not replace, personalized advice from healthcare professionals.","PeriodicalId":51333,"journal":{"name":"DIGITAL HEALTH","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11325467/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"DIGITAL HEALTH","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/20552076241269538","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: To assess the quality and alignment of ChatGPT's cancer treatment recommendations (RECs) with National Comprehensive Cancer Network (NCCN) guidelines and expert opinions.

Methods: Three urologists performed quantitative and qualitative assessments in October 2023 analyzing responses from ChatGPT-4 and ChatGPT-3.5 to 108 prostate, kidney, and bladder cancer prompts using two zero-shot prompt templates. Performance evaluation involved calculating five ratios: expert-approved/expert-disagreed and NCCN-aligned RECs against total ChatGPT RECs plus coverage and adherence rates to NCCN. Experts rated the response's quality on a 1-5 scale considering correctness, comprehensiveness, specificity, and appropriateness.

Results: ChatGPT-4 outperformed ChatGPT-3.5 in prostate cancer inquiries, with an average word count of 317.3 versus 124.4 (p < 0.001) and 6.1 versus 3.9 RECs (p < 0.001). Its rater-approved REC ratio (96.1% vs. 89.4%) and alignment with NCCN guidelines (76.8% vs. 49.1%, p = 0.001) were superior and scored significantly better on all quality dimensions. Across 108 prompts covering three cancers, ChatGPT-4 produced an average of 6.0 RECs per case, with an 88.5% approval rate from raters, 86.7% NCCN concordance, and only a 9.5% disagreement rate. It achieved high marks in correctness (4.5), comprehensiveness (4.4), specificity (4.0), and appropriateness (4.4). Subgroup analyses across cancer types, disease statuses, and different prompt templates were reported.

Conclusions: ChatGPT-4 demonstrated significant improvement in providing accurate and detailed treatment recommendations for urological cancers in line with clinical guidelines and expert opinion. However, it is vital to recognize that AI tools are not without flaws and should be utilized with caution. ChatGPT could supplement, but not replace, personalized advice from healthcare professionals.

查看原文本刊更多论文

在癌症治疗建议的质量、临床指南和专家意见的一致性方面，ChatGPT v4 优于 v3.5。

目的评估 ChatGPT 的癌症治疗建议 (REC) 的质量以及与美国国家综合癌症网络 (NCCN) 指南和专家意见的一致性：三位泌尿科专家于 2023 年 10 月进行了定量和定性评估，分析了 ChatGPT-4 和 ChatGPT-3.5 对 108 个前列腺癌、肾癌和膀胱癌提示的回复，并使用了两个零点提示模板。绩效评估包括计算五个比率：专家批准/专家不同意和与 NCCN 一致的 REC 与 ChatGPT REC 总数的比率，以及覆盖率和 NCCN 的遵守率。专家们根据回答的正确性、全面性、特异性和适当性，对回答的质量进行 1-5 级评分：结果：ChatGPT-4 在前列腺癌咨询中的表现优于 ChatGPT-3.5，平均字数为 317.3 对 124.4（p p = 0.001），在所有质量维度上都更胜一筹，得分明显更高。在涵盖三种癌症的 108 条提示中，ChatGPT-4 平均每个病例产生 6.0 个 REC，评分者的认可率为 88.5%，NCCN 一致率为 86.7%，不一致率仅为 9.5%。其正确性（4.5 分）、全面性（4.4 分）、特异性（4.0 分）和适当性（4.4 分）均获得高分。报告对不同癌症类型、疾病状态和不同提示模板进行了分组分析：ChatGPT-4在根据临床指南和专家意见提供准确、详细的泌尿系统癌症治疗建议方面取得了显著进步。然而，我们必须认识到，人工智能工具并非没有缺陷，应谨慎使用。ChatGPT 可以补充而非取代医护人员的个性化建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

DIGITAL HEALTH Multiple-

CiteScore

2.90

自引率

7.70%

发文量

302

文献相关原料

公司名称	产品信息	采购帮参考价格