分析 ChatGPT 关于颈椎病诊断和治疗的建议。

IF 2.9 2区医学 Q2 CLINICAL NEUROLOGY

Journal of neurosurgery. Spine Pub Date : 2024-06-28 Print Date: 2024-09-01 DOI:10.3171/2024.4.SPINE231148

Timothy Hoang, Lathan Liou, Ashley M Rosenberg, Bashar Zaidat, Akiro H Duey, Nancy Shrestha, Wasil Ahmed, Justin Tang, Jun S Kim, Samuel K Cho

{"title":"分析 ChatGPT 关于颈椎病诊断和治疗的建议。","authors":"Timothy Hoang, Lathan Liou, Ashley M Rosenberg, Bashar Zaidat, Akiro H Duey, Nancy Shrestha, Wasil Ahmed, Justin Tang, Jun S Kim, Samuel K Cho","doi":"10.3171/2024.4.SPINE231148","DOIUrl":null,"url":null,"abstract":"Objective: The objective of this study was to assess the safety and accuracy of ChatGPT recommendations in comparison to the evidence-based guidelines from the North American Spine Society (NASS) for the diagnosis and treatment of cervical radiculopathy.Methods: ChatGPT was prompted with questions from the 2011 NASS clinical guidelines for cervical radiculopathy and evaluated for concordance. Selected key phrases within the NASS guidelines were identified. Completeness was measured as the number of overlapping key phrases between ChatGPT responses and NASS guidelines divided by the total number of key phrases. A senior spine surgeon evaluated the ChatGPT responses for safety and accuracy. ChatGPT responses were further evaluated on their readability, similarity, and consistency. Flesch Reading Ease scores and Flesch-Kincaid reading levels were measured to assess readability. The Jaccard Similarity Index was used to assess agreement between ChatGPT responses and NASS clinical guidelines.Results: A total of 100 key phrases were identified across 14 NASS clinical guidelines. The mean completeness of ChatGPT-4 was 46%. ChatGPT-3.5 yielded a completeness of 34%. ChatGPT-4 outperformed ChatGPT-3.5 by a margin of 12%. ChatGPT-4.0 outputs had a mean Flesch reading score of 15.24, which is very difficult to read, requiring a college graduate education to understand. ChatGPT-3.5 outputs had a lower mean Flesch reading score of 8.73, indicating that they are even more difficult to read and require a professional education level to do so. However, both versions of ChatGPT were more accessible than NASS guidelines, which had a mean Flesch reading score of 4.58. Furthermore, with NASS guidelines as a reference, ChatGPT-3.5 registered a mean ± SD Jaccard Similarity Index score of 0.20 ± 0.078 while ChatGPT-4 had a mean of 0.18 ± 0.068. Based on physician evaluation, outputs from ChatGPT-3.5 and ChatGPT-4.0 were safe 100% of the time. Thirteen of 14 (92.8%) ChatGPT-3.5 responses and 14 of 14 (100%) ChatGPT-4.0 responses were in agreement with current best clinical practices for cervical radiculopathy according to a senior spine surgeon.Conclusions: ChatGPT models were able to provide safe and accurate but incomplete responses to NASS clinical guideline questions about cervical radiculopathy. Although the authors' results suggest that improvements are required before ChatGPT can be reliably deployed in a clinical setting, future versions of the LLM hold promise as an updated reference for guidelines on cervical radiculopathy. Future versions must prioritize accessibility and comprehensibility for a diverse audience.","PeriodicalId":16562,"journal":{"name":"Journal of neurosurgery. Spine","volume":" ","pages":"385-395"},"PeriodicalIF":2.9000,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An analysis of ChatGPT recommendations for the diagnosis and treatment of cervical radiculopathy.\",\"authors\":\"Timothy Hoang, Lathan Liou, Ashley M Rosenberg, Bashar Zaidat, Akiro H Duey, Nancy Shrestha, Wasil Ahmed, Justin Tang, Jun S Kim, Samuel K Cho\",\"doi\":\"10.3171/2024.4.SPINE231148\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: The objective of this study was to assess the safety and accuracy of ChatGPT recommendations in comparison to the evidence-based guidelines from the North American Spine Society (NASS) for the diagnosis and treatment of cervical radiculopathy.Methods: ChatGPT was prompted with questions from the 2011 NASS clinical guidelines for cervical radiculopathy and evaluated for concordance. Selected key phrases within the NASS guidelines were identified. Completeness was measured as the number of overlapping key phrases between ChatGPT responses and NASS guidelines divided by the total number of key phrases. A senior spine surgeon evaluated the ChatGPT responses for safety and accuracy. ChatGPT responses were further evaluated on their readability, similarity, and consistency. Flesch Reading Ease scores and Flesch-Kincaid reading levels were measured to assess readability. The Jaccard Similarity Index was used to assess agreement between ChatGPT responses and NASS clinical guidelines.Results: A total of 100 key phrases were identified across 14 NASS clinical guidelines. The mean completeness of ChatGPT-4 was 46%. ChatGPT-3.5 yielded a completeness of 34%. ChatGPT-4 outperformed ChatGPT-3.5 by a margin of 12%. ChatGPT-4.0 outputs had a mean Flesch reading score of 15.24, which is very difficult to read, requiring a college graduate education to understand. ChatGPT-3.5 outputs had a lower mean Flesch reading score of 8.73, indicating that they are even more difficult to read and require a professional education level to do so. However, both versions of ChatGPT were more accessible than NASS guidelines, which had a mean Flesch reading score of 4.58. Furthermore, with NASS guidelines as a reference, ChatGPT-3.5 registered a mean ± SD Jaccard Similarity Index score of 0.20 ± 0.078 while ChatGPT-4 had a mean of 0.18 ± 0.068. Based on physician evaluation, outputs from ChatGPT-3.5 and ChatGPT-4.0 were safe 100% of the time. Thirteen of 14 (92.8%) ChatGPT-3.5 responses and 14 of 14 (100%) ChatGPT-4.0 responses were in agreement with current best clinical practices for cervical radiculopathy according to a senior spine surgeon.Conclusions: ChatGPT models were able to provide safe and accurate but incomplete responses to NASS clinical guideline questions about cervical radiculopathy. Although the authors' results suggest that improvements are required before ChatGPT can be reliably deployed in a clinical setting, future versions of the LLM hold promise as an updated reference for guidelines on cervical radiculopathy. Future versions must prioritize accessibility and comprehensibility for a diverse audience.\",\"PeriodicalId\":16562,\"journal\":{\"name\":\"Journal of neurosurgery. Spine\",\"volume\":\" \",\"pages\":\"385-395\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of neurosurgery. Spine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3171/2024.4.SPINE231148\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/1 0:00:00\",\"PubModel\":\"Print\",\"JCR\":\"Q2\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of neurosurgery. Spine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3171/2024.4.SPINE231148","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/1 0:00:00","PubModel":"Print","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究旨在评估 ChatGPT 建议与北美脊柱协会（NASS）颈椎病诊断和治疗循证指南相比的安全性和准确性：方法：用 2011 NASS 颈椎病临床指南中的问题提示 ChatGPT，并评估其一致性。确定了 NASS 指南中的部分关键短语。完整度以 ChatGPT 回答与 NASS 指南中重叠的关键短语数量除以关键短语总数来衡量。一名资深脊柱外科医生对 ChatGPT 回复的安全性和准确性进行了评估。对 ChatGPT 回复的可读性、相似性和一致性进行了进一步评估。为评估可读性，测量了 Flesch 阅读容易度得分和 Flesch-Kincaid 阅读水平。Jaccard 相似度指数用于评估 ChatGPT 回答与 NASS 临床指南之间的一致性：结果：14 项 NASS 临床指南共识别出 100 个关键短语。ChatGPT-4 的平均完整率为 46%。ChatGPT-3.5 的完整率为 34%。ChatGPT-4 比 ChatGPT-3.5 高出 12%。ChatGPT-4.0 输出的平均弗莱什阅读分数为 15.24，非常难读，需要大学毕业才能理解。ChatGPT-3.5 输出的平均 Flesch 阅读分数较低，为 8.73 分，这表明它们更难阅读，需要专业教育水平才能读懂。不过，两个版本的 ChatGPT 都比 NASS 指南更易于理解，后者的平均弗莱什阅读分数为 4.58。此外，以 NASS 指南为参考，ChatGPT-3.5 的 Jaccard 相似度指数平均值（± SD）为 0.20 ± 0.078，而 ChatGPT-4 的平均值为 0.18 ± 0.068。根据医生的评估，ChatGPT-3.5 和 ChatGPT-4.0 的输出 100%是安全的。一位资深脊柱外科医生认为，14 个 ChatGPT-3.5 回答中有 13 个（92.8%）和 14 个 ChatGPT-4.0 回答（100%）与目前治疗颈椎病的最佳临床实践一致：ChatGPT 模型能够对 NASS 临床指南中有关颈椎病的问题做出安全、准确但不完整的回答。尽管作者的研究结果表明，在临床环境中可靠部署 ChatGPT 之前还需要改进，但未来版本的 LLM 有希望成为颈椎病指南的最新参考资料。未来版本必须优先考虑不同受众的可访问性和可理解性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An analysis of ChatGPT recommendations for the diagnosis and treatment of cervical radiculopathy.

Objective: The objective of this study was to assess the safety and accuracy of ChatGPT recommendations in comparison to the evidence-based guidelines from the North American Spine Society (NASS) for the diagnosis and treatment of cervical radiculopathy.

Methods: ChatGPT was prompted with questions from the 2011 NASS clinical guidelines for cervical radiculopathy and evaluated for concordance. Selected key phrases within the NASS guidelines were identified. Completeness was measured as the number of overlapping key phrases between ChatGPT responses and NASS guidelines divided by the total number of key phrases. A senior spine surgeon evaluated the ChatGPT responses for safety and accuracy. ChatGPT responses were further evaluated on their readability, similarity, and consistency. Flesch Reading Ease scores and Flesch-Kincaid reading levels were measured to assess readability. The Jaccard Similarity Index was used to assess agreement between ChatGPT responses and NASS clinical guidelines.

Results: A total of 100 key phrases were identified across 14 NASS clinical guidelines. The mean completeness of ChatGPT-4 was 46%. ChatGPT-3.5 yielded a completeness of 34%. ChatGPT-4 outperformed ChatGPT-3.5 by a margin of 12%. ChatGPT-4.0 outputs had a mean Flesch reading score of 15.24, which is very difficult to read, requiring a college graduate education to understand. ChatGPT-3.5 outputs had a lower mean Flesch reading score of 8.73, indicating that they are even more difficult to read and require a professional education level to do so. However, both versions of ChatGPT were more accessible than NASS guidelines, which had a mean Flesch reading score of 4.58. Furthermore, with NASS guidelines as a reference, ChatGPT-3.5 registered a mean ± SD Jaccard Similarity Index score of 0.20 ± 0.078 while ChatGPT-4 had a mean of 0.18 ± 0.068. Based on physician evaluation, outputs from ChatGPT-3.5 and ChatGPT-4.0 were safe 100% of the time. Thirteen of 14 (92.8%) ChatGPT-3.5 responses and 14 of 14 (100%) ChatGPT-4.0 responses were in agreement with current best clinical practices for cervical radiculopathy according to a senior spine surgeon.

Conclusions: ChatGPT models were able to provide safe and accurate but incomplete responses to NASS clinical guideline questions about cervical radiculopathy. Although the authors' results suggest that improvements are required before ChatGPT can be reliably deployed in a clinical setting, future versions of the LLM hold promise as an updated reference for guidelines on cervical radiculopathy. Future versions must prioritize accessibility and comprehensibility for a diverse audience.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of neurosurgery. Spine 医学-临床神经学

CiteScore

5.10

自引率

10.70%

发文量

396

审稿时长

6 months

期刊介绍： Primarily publish original works in neurosurgery but also include studies in clinical neurophysiology, organic neurology, ophthalmology, radiology, pathology, and molecular biology.