ChatGPT-3.5、chatgpt - 40、Copilot、Gemini、Claude和Perplexity针对临床实践指南建议腰骶神经根性疼痛的准确性:横断面研究

IF 3.2 Q1 HEALTH CARE SCIENCES & SERVICES
Frontiers in digital health Pub Date : 2025-06-27 eCollection Date: 2025-01-01 DOI:10.3389/fdgth.2025.1574287
Giacomo Rossettini, Silvia Bargeri, Chad Cook, Stefania Guida, Alvisa Palese, Lia Rodeghiero, Paolo Pillastrini, Andrea Turolla, Greta Castellini, Silvia Gianola
{"title":"ChatGPT-3.5、chatgpt - 40、Copilot、Gemini、Claude和Perplexity针对临床实践指南建议腰骶神经根性疼痛的准确性:横断面研究","authors":"Giacomo Rossettini, Silvia Bargeri, Chad Cook, Stefania Guida, Alvisa Palese, Lia Rodeghiero, Paolo Pillastrini, Andrea Turolla, Greta Castellini, Silvia Gianola","doi":"10.3389/fdgth.2025.1574287","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Artificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.</p><p><strong>Methods: </strong>We performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.</p><p><strong>Results: </strong>We found high variability in the text consistency of AI chatbot responses (median range 26%-68%). Intra-rater reliability ranged from \"almost perfect\" to \"substantial,\" while inter-rater reliability varied from \"almost perfect\" to \"moderate.\" Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.</p><p><strong>Conclusions: </strong>Despite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.</p>","PeriodicalId":73078,"journal":{"name":"Frontiers in digital health","volume":"7 ","pages":"1574287"},"PeriodicalIF":3.2000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12245906/pdf/","citationCount":"0","resultStr":"{\"title\":\"Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.\",\"authors\":\"Giacomo Rossettini, Silvia Bargeri, Chad Cook, Stefania Guida, Alvisa Palese, Lia Rodeghiero, Paolo Pillastrini, Andrea Turolla, Greta Castellini, Silvia Gianola\",\"doi\":\"10.3389/fdgth.2025.1574287\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>Artificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.</p><p><strong>Methods: </strong>We performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.</p><p><strong>Results: </strong>We found high variability in the text consistency of AI chatbot responses (median range 26%-68%). Intra-rater reliability ranged from \\\"almost perfect\\\" to \\\"substantial,\\\" while inter-rater reliability varied from \\\"almost perfect\\\" to \\\"moderate.\\\" Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.</p><p><strong>Conclusions: </strong>Despite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.</p>\",\"PeriodicalId\":73078,\"journal\":{\"name\":\"Frontiers in digital health\",\"volume\":\"7 \",\"pages\":\"1574287\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12245906/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in digital health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fdgth.2025.1574287\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdgth.2025.1574287","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

简介:人工智能(AI)聊天机器人可以根据大量数据生成类似人类的反应,通过提供有关健康状况、治疗和预防措施的信息,作为虚拟助手,正在成为医疗保健领域的重要工具。然而,他们的表现与临床实践指南(CPGs)一致,为腰骶神经根性疼痛的复杂临床问题提供答案仍不清楚。我们的目标是评估人工智能聊天机器人在诊断和治疗腰骶神经根性疼痛方面的表现。方法:我们进行了一项横断面研究,以评估AI聊天机器人对CPGs诊断和治疗腰骶神经根性疼痛的建议的反应。基于这些cpg的临床问题被提出给六个AI聊天机器人的最新版本(2024年更新):ChatGPT-3.5、chatgpt - 40、Microsoft Copilot、谷歌Gemini、Claude和Perplexity。使用抄袭检查器X对聊天机器人的回复进行评估:(a)文本回复的一致性,(b)使用Fleiss' Kappa的内部和内部可靠性,以及(c)与cpg的匹配率。采用STATA/MP 16.1进行统计学分析。结果:我们发现AI聊天机器人响应的文本一致性具有很高的可变性(中位数范围为26%-68%)。评分者内部的信度范围从“几乎完美”到“相当”,而评分者之间的信度范围从“几乎完美”到“中等”。Perplexity的匹配率最高,为67%,谷歌Gemini为63%,Microsoft Copilot为44%。ChatGPT-3.5、chatgpt - 40和Claude表现最差,匹配率均为33%。结论:尽管在内部一致性和良好的内部和内部可靠性方面存在差异,但AI聊天机器人在诊断和治疗腰骶神经根性疼痛方面的建议通常与CPGs的建议不一致。临床医生和患者在依赖这些人工智能模型时应谨慎行事,因为根据特定的聊天机器人,所提供的建议中有1至2 / 3可能是不合适的或具有误导性的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Accuracy of ChatGPT-3.5, ChatGPT-4o, Copilot, Gemini, Claude, and Perplexity in advising on lumbosacral radicular pain against clinical practice guidelines: cross-sectional study.

Introduction: Artificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.

Methods: We performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.

Results: We found high variability in the text consistency of AI chatbot responses (median range 26%-68%). Intra-rater reliability ranged from "almost perfect" to "substantial," while inter-rater reliability varied from "almost perfect" to "moderate." Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.

Conclusions: Despite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.20
自引率
0.00%
发文量
0
审稿时长
13 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信