ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source.

IF 2.3 Q2 ORTHOPEDICS

JBJS Open Access Pub Date : 2024-09-05 eCollection Date: 2024-07-01 DOI:10.2106/JBJS.OA.24.00099

Diane Ghanem, Alexander R Zhu, Whitney Kagabo, Greg Osgood, Babar Shafiq

{"title":"ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source.","authors":"Diane Ghanem, Alexander R Zhu, Whitney Kagabo, Greg Osgood, Babar Shafiq","doi":"10.2106/JBJS.OA.24.00099","DOIUrl":null,"url":null,"abstract":"Introduction: The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery.Methods: Two independent reviewers critically assessed 30 ChatGPT-4-generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4-generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used.Results: ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed \"true\" while 56.7% were categorized as \"false\" (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437).Discussion: With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references-a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.","PeriodicalId":36492,"journal":{"name":"JBJS Open Access","volume":"9 3","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11368215/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JBJS Open Access","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2106/JBJS.OA.24.00099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery.

Methods: Two independent reviewers critically assessed 30 ChatGPT-4-generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4-generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used.

Results: ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed "true" while 56.7% were categorized as "false" (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437).

Discussion: With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references-a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.

查看原文本刊更多论文

ChatGPT-4 知道它的 A B C D E，但不能引用它的来源。

前言人工智能语言模型 "聊天生成预训练转换器"（ChatGPT）已显示出作为可靠、易用的矫形外科教育资源的潜力。然而，所提供信息背后参考文献的准确性仍然难以确定，这对维护医疗内容的完整性构成了威胁。本研究旨在检查 ChatGPT-4 提供的有关创伤外科气道、呼吸、循环、残疾、暴露（ABCDE）方法的参考文献的准确性：方法：两位独立审稿人严格评估了 ChatGPT-4 生成的 30 篇参考文献，这些参考文献支持创伤方案中行之有效的 ABCDE 方法，并将其分为 0（不存在）、1（不准确）或 2（准确）三个等级。我们仔细审查了 ChatGPT-4 和 PubMed 参考文献之间的所有差异，并用粗体标出。科恩卡帕系数（Cohen's Kappa coefficient）用于检查审稿人之间对 ChatGPT-4 生成的参考文献准确性评分的一致性。描述性统计用于总结参考文献准确性的平均得分。为了比较 5 个类别的平均值差异，使用了单因素方差分析：ChatGPT-4 的平均参考文献准确率为 66.7%。在 30 个参考资料中，只有 43.3% 是准确的并被认为是 "真实的"，而 56.7% 被归类为 "错误的"（43.3% 不准确，13.3% 不存在）。5 个创伤协议类别的准确性是一致的，没有显著的统计学差异（P = 0.437）：讨论：57%的参考文献不准确或不存在，ChatGPT-4 在提供可靠和可重复的参考文献方面存在不足，这一发现令人担忧，因为未经彻底验证就将 ChatGPT-4 用于专业医疗决策的安全性将受到影响。只有谨慎使用，并进行交叉引用，该语言模型才能成为一种辅助学习工具，提高全面性以及知识演练和操作能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊