Quality of Information Provided by Artificial Intelligence Chatbots Surrounding the Management of Vestibular Schwannomas: A Comparative Analysis Between ChatGPT-4 and Claude 2.

IF 1.9 3区医学 Q3 CLINICAL NEUROLOGY

Otology & Neurotology Pub Date : 2025-04-01 Epub Date: 2025-02-04 DOI:10.1097/MAO.0000000000004410

Daniele Borsetto, Egidio Sia, Patrick Axon, Neil Donnelly, James R Tysome, Lukas Anschuetz, Daniele Bernardeschi, Vincenzo Capriotti, Per Caye-Thomasen, Niels Cramer West, Isaac D Erbele, Sebastiano Franchella, Annalisa Gatto, Jeanette Hess-Erga, Henricus P M Kunst, John P Marinelli, Richard Mannion, Benedict Panizza, Franco Trabalzini, Rupert Obholzer, Luigi Angelo Vaira, Jerry Polesel, Fabiola Giudici, Matthew L Carlson, Giancarlo Tirelli, Paolo Boscolo-Rizzo

{"title":"Quality of Information Provided by Artificial Intelligence Chatbots Surrounding the Management of Vestibular Schwannomas: A Comparative Analysis Between ChatGPT-4 and Claude 2.","authors":"Daniele Borsetto, Egidio Sia, Patrick Axon, Neil Donnelly, James R Tysome, Lukas Anschuetz, Daniele Bernardeschi, Vincenzo Capriotti, Per Caye-Thomasen, Niels Cramer West, Isaac D Erbele, Sebastiano Franchella, Annalisa Gatto, Jeanette Hess-Erga, Henricus P M Kunst, John P Marinelli, Richard Mannion, Benedict Panizza, Franco Trabalzini, Rupert Obholzer, Luigi Angelo Vaira, Jerry Polesel, Fabiola Giudici, Matthew L Carlson, Giancarlo Tirelli, Paolo Boscolo-Rizzo","doi":"10.1097/MAO.0000000000004410","DOIUrl":null,"url":null,"abstract":"Objective: To examine the quality of information provided by artificial intelligence platforms ChatGPT-4 and Claude 2 surrounding the management of vestibular schwannomas.Study design: Cross-sectional.Setting: Skull base surgeons were involved from different centers and countries.Intervention: Thirty-six questions regarding vestibular schwannoma management were tested. Artificial intelligence responses were subsequently evaluated by 19 lateral skull base surgeons using the Quality Assessment of Medical Artificial Intelligence (QAMAI) questionnaire, assessing \"Accuracy,\" \"Clarity,\" \"Relevance,\" \"Completeness,\" \"Sources,\" and \"Usefulness.\"Main outcome measure: The scores of the answers from both chatbots were collected and analyzed using the Student t test. Analysis of responses grouped by stakeholders was performed with McNemar test. Stuart-Maxwell test was used to compare reading level among chatbots. Intraclass correlation coefficient was calculated.Results: ChatGPT-4 demonstrated significantly improved quality over Claude 2 in 14 of 36 (38.9%) questions, whereas higher-quality scores for Claude 2 were only observed in 2 (5.6%) answers. Chatbots exhibited variation across the dimensions of \"Accuracy,\" \"Clarity,\" \"Completeness,\" \"Relevance,\" and \"Usefulness,\" with ChatGPT-4 demonstrating a statistically significant superior performance. However, no statistically significant difference was found in the assessment of \"Sources.\" Additionally, ChatGPT-4 provided information at a significant lower reading grade level.Conclusions: Artificial intelligence platforms failed to consistently provide accurate information surrounding the management of vestibular schwannoma, although ChatGPT-4 achieved significantly higher scores in most analyzed parameters. These findings demonstrate the potential for significant misinformation for patients seeking information through these platforms.","PeriodicalId":19732,"journal":{"name":"Otology & Neurotology","volume":" ","pages":"432-436"},"PeriodicalIF":1.9000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Otology & Neurotology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/MAO.0000000000004410","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/4 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: To examine the quality of information provided by artificial intelligence platforms ChatGPT-4 and Claude 2 surrounding the management of vestibular schwannomas.

Study design: Cross-sectional.

Setting: Skull base surgeons were involved from different centers and countries.

Intervention: Thirty-six questions regarding vestibular schwannoma management were tested. Artificial intelligence responses were subsequently evaluated by 19 lateral skull base surgeons using the Quality Assessment of Medical Artificial Intelligence (QAMAI) questionnaire, assessing "Accuracy," "Clarity," "Relevance," "Completeness," "Sources," and "Usefulness."

Main outcome measure: The scores of the answers from both chatbots were collected and analyzed using the Student t test. Analysis of responses grouped by stakeholders was performed with McNemar test. Stuart-Maxwell test was used to compare reading level among chatbots. Intraclass correlation coefficient was calculated.

Results: ChatGPT-4 demonstrated significantly improved quality over Claude 2 in 14 of 36 (38.9%) questions, whereas higher-quality scores for Claude 2 were only observed in 2 (5.6%) answers. Chatbots exhibited variation across the dimensions of "Accuracy," "Clarity," "Completeness," "Relevance," and "Usefulness," with ChatGPT-4 demonstrating a statistically significant superior performance. However, no statistically significant difference was found in the assessment of "Sources." Additionally, ChatGPT-4 provided information at a significant lower reading grade level.

Conclusions: Artificial intelligence platforms failed to consistently provide accurate information surrounding the management of vestibular schwannoma, although ChatGPT-4 achieved significantly higher scores in most analyzed parameters. These findings demonstrate the potential for significant misinformation for patients seeking information through these platforms.

查看原文本刊更多论文

人工智能聊天机器人在前庭神经鞘瘤治疗中的信息质量：ChatGPT-4和Claude 2的比较分析

目的：探讨人工智能平台ChatGPT-4和Claude 2在前庭神经鞘瘤治疗中的信息质量。研究设计：横断面。背景：颅底外科医生来自不同的中心和国家。干预：测试了关于前庭神经鞘瘤处理的36个问题。随后，19名侧颅底外科医生使用医学人工智能质量评估（QAMAI）问卷对人工智能反应进行评估，评估“准确性”、“清晰度”、“相关性”、“完整性”、“来源”和“有用性”。“主要结果测量：收集两个聊天机器人的答案得分，并使用学生t检验进行分析。采用McNemar测试对利益相关者分组的反应进行分析。采用Stuart-Maxwell测试比较聊天机器人的阅读水平。计算类内相关系数。结果：ChatGPT-4在36个问题中有14个（38.9%）的质量明显高于Claude 2，而Claude 2的高质量分数仅在2个（5.6%）的答案中被观察到。聊天机器人在“准确性”、“清晰度”、“完整性”、“相关性”和“实用性”等维度上表现出差异，而ChatGPT-4在统计上表现出显著的优势。然而，在“来源”的评估中没有发现统计学上的显著差异。此外，ChatGPT-4提供的信息明显低于阅读年级水平。结论：人工智能平台未能始终如一地提供有关前庭神经鞘瘤管理的准确信息，尽管ChatGPT-4在大多数分析参数中获得了显着更高的分数。这些发现表明，通过这些平台寻求信息的患者可能存在严重的错误信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Otology & Neurotology 医学-耳鼻喉科学

CiteScore

3.80

自引率

14.30%

发文量

509

审稿时长

3-6 weeks

期刊介绍： Otology & Neurotology publishes original articles relating to both clinical and basic science aspects of otology, neurotology, and cranial base surgery. As the foremost journal in its field, it has become the favored place for publishing the best of new science relating to the human ear and its diseases. The broadly international character of its contributing authors, editorial board, and readership provides the Journal its decidedly global perspective.