Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.

IF 3.8 2区医学 Q1 CLINICAL NEUROLOGY

Neurospine Pub Date : 2024-06-01 Epub Date: 2024-06-30 DOI:10.14245/ns.2448098.049

Siegmund Philipp Lang, Ezra Tilahun Yoseph, Aneysis D Gonzalez-Suarez, Robert Kim, Parastou Fatemi, Katherine Wagner, Nicolai Maldaner, Martin N Stienen, Corinna Clio Zygourakis

{"title":"Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.","authors":"Siegmund Philipp Lang, Ezra Tilahun Yoseph, Aneysis D Gonzalez-Suarez, Robert Kim, Parastou Fatemi, Katherine Wagner, Nicolai Maldaner, Martin N Stienen, Corinna Clio Zygourakis","doi":"10.14245/ns.2448098.049","DOIUrl":null,"url":null,"abstract":"Objective: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.Methods: Our study aims to assess the response quality of Open AI (artificial intelligence)'s ChatGPT 3.5 and Google's Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from 'unsatisfactory' to 'excellent.' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.Results: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard's responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.Conclusion: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs' role in medical education and healthcare communication.","PeriodicalId":19269,"journal":{"name":"Neurospine","volume":"21 2","pages":"633-641"},"PeriodicalIF":3.8000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11224745/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurospine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.14245/ns.2448098.049","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/6/30 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.

Methods: Our study aims to assess the response quality of Open AI (artificial intelligence)'s ChatGPT 3.5 and Google's Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from 'unsatisfactory' to 'excellent.' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.

Results: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard's responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.

Conclusion: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs' role in medical education and healthcare communication.

查看原文本刊更多论文

分析大语言模型对常见腰椎融合手术问题的回答：ChatGPT 与 Bard 的比较。

目的：在数字时代，患者通过网络渠道获取腰椎融合术信息，因此有必要仔细研究用于患者教育的大型语言模型（LLM），如聊天生成预训练转换器（ChatGPT）：我们的研究旨在评估 Open AI（人工智能）的 ChatGPT 3.5 和谷歌的 Bard 对腰椎融合手术患者问题的响应质量。我们通过谷歌搜索从 158 个常见问题中确定了 10 个关键问题，然后将其提交给这两个聊天机器人。五位双盲脊柱外科医生按照从 "不满意 "到 "优秀 "的四级评分标准对回复进行了评分。回答的清晰度和专业性也用 5 分李克特量表进行了评估：我们对 ChatGPT 3.5 和 Bard 中的 10 个问题进行了评估，97% 的回答被评为优秀或满意。具体来说，ChatGPT 有 62% 的回答为优秀，32% 的回答为基本澄清，只有 6% 的回答需要中度或大幅澄清。Bard 的回复中 66% 为优秀，24% 为基本澄清，10% 需要更多澄清。两种型号的总体评分分布没有明显差异。在手术风险、成功率和手术方法选择（Q3、Q4 和 Q5）这 3 个具体问题上，两者都有困难。两种模型的互测可靠性都很低（ChatGPT：k = 0.041，p = 0.622；Bard：k = -0.040，p = 0.601）。虽然两者在理解和移情方面得分都很高，但 Bard 在移情和专业性方面的评分略低：结论：ChatGPT3.5 和 Bard 有效地回答了腰椎融合常见问题，但还需要进一步的培训和研究来巩固 LLM 在医学教育和医疗沟通中的作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neurospine Multiple-

CiteScore

5.80

自引率

18.80%

发文量

审稿时长

10 weeks