Is the information provided by large language models valid in educating patients about adolescent idiopathic scoliosis? An evaluation of content, clarity, and empathy : The perspective of the European Spine Study Group.

IF 1.8 Q3 CLINICAL NEUROLOGY

Spine deformity Pub Date : 2025-03-01 Epub Date: 2024-11-04 DOI:10.1007/s43390-024-00955-3

Siegmund Lang, Jacopo Vitale, Fabio Galbusera, Tamás Fekete, Louis Boissiere, Yann Philippe Charles, Altug Yucekul, Caglar Yilgor, Susana Núñez-Pereira, Sleiman Haddad, Alejandro Gomez-Rice, Jwalant Mehta, Javier Pizones, Ferran Pellisé, Ibrahim Obeid, Ahmet Alanay, Frank Kleinstück, Markus Loibl

{"title":"Is the information provided by large language models valid in educating patients about adolescent idiopathic scoliosis? An evaluation of content, clarity, and empathy : The perspective of the European Spine Study Group.","authors":"Siegmund Lang, Jacopo Vitale, Fabio Galbusera, Tamás Fekete, Louis Boissiere, Yann Philippe Charles, Altug Yucekul, Caglar Yilgor, Susana Núñez-Pereira, Sleiman Haddad, Alejandro Gomez-Rice, Jwalant Mehta, Javier Pizones, Ferran Pellisé, Ibrahim Obeid, Ahmet Alanay, Frank Kleinstück, Markus Loibl","doi":"10.1007/s43390-024-00955-3","DOIUrl":null,"url":null,"abstract":"Purpose: Large language models (LLM) have the potential to bridge knowledge gaps in patient education and enrich patient-surgeon interactions. This study evaluated three chatbots for delivering empathetic and precise adolescent idiopathic scoliosis (AIS) related information and management advice. Specifically, we assessed the accuracy, clarity, and relevance of the information provided, aiming to determine the effectiveness of LLMs in addressing common patient queries and enhancing their understanding of AIS.Methods: We sourced 20 webpages for the top frequently asked questions (FAQs) about AIS and formulated 10 critical questions based on them. Three advanced LLMs-ChatGPT 3.5, ChatGPT 4.0, and Google Bard-were selected to answer these questions, with responses limited to 200 words. The LLMs' responses were evaluated by a blinded group of experienced deformity surgeons (members of the European Spine Study Group) from seven European spine centers. A pre-established 4-level rating system from excellent to unsatisfactory was used with a further rating for clarity, comprehensiveness, and empathy on the 5-point Likert scale. If not rated 'excellent', the raters were asked to report the reasons for their decision for each question. Lastly, raters were asked for their opinion towards AI in healthcare in general in six questions.Results: The responses among all LLMs were 'excellent' in 26% of responses, with ChatGPT-4.0 leading (39%), followed by Bard (17%). ChatGPT-4.0 was rated superior to Bard and ChatGPT 3.5 (p = 0.003). Discrepancies among raters were significant (p < 0.0001), questioning inter-rater reliability. No substantial differences were noted in answer distribution by question (p = 0.43). The answers on diagnosis (Q2) and causes (Q4) of AIS were top-rated. The most dissatisfaction was seen in the answers regarding definitions (Q1) and long-term results (Q7). Exhaustiveness, clarity, empathy, and length of the answers were positively rated (> 3.0 on 5.0) and did not demonstrate any differences among LLMs. However, GPT-3.5 struggled with language suitability and empathy, while Bard's responses were overly detailed and less empathetic. Overall, raters found that 9% of answers were off-topic and 22% contained clear mistakes.Conclusion: Our study offers crucial insights into the strengths and weaknesses of current LLMs in AIS patient and parent education, highlighting the promise of advancements like ChatGPT-4.o and Gemini alongside the need for continuous improvement in empathy, contextual understanding, and language appropriateness.","PeriodicalId":21796,"journal":{"name":"Spine deformity","volume":" ","pages":"361-372"},"PeriodicalIF":1.8000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11893626/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Spine deformity","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s43390-024-00955-3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/4 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Large language models (LLM) have the potential to bridge knowledge gaps in patient education and enrich patient-surgeon interactions. This study evaluated three chatbots for delivering empathetic and precise adolescent idiopathic scoliosis (AIS) related information and management advice. Specifically, we assessed the accuracy, clarity, and relevance of the information provided, aiming to determine the effectiveness of LLMs in addressing common patient queries and enhancing their understanding of AIS.

Methods: We sourced 20 webpages for the top frequently asked questions (FAQs) about AIS and formulated 10 critical questions based on them. Three advanced LLMs-ChatGPT 3.5, ChatGPT 4.0, and Google Bard-were selected to answer these questions, with responses limited to 200 words. The LLMs' responses were evaluated by a blinded group of experienced deformity surgeons (members of the European Spine Study Group) from seven European spine centers. A pre-established 4-level rating system from excellent to unsatisfactory was used with a further rating for clarity, comprehensiveness, and empathy on the 5-point Likert scale. If not rated 'excellent', the raters were asked to report the reasons for their decision for each question. Lastly, raters were asked for their opinion towards AI in healthcare in general in six questions.

Results: The responses among all LLMs were 'excellent' in 26% of responses, with ChatGPT-4.0 leading (39%), followed by Bard (17%). ChatGPT-4.0 was rated superior to Bard and ChatGPT 3.5 (p = 0.003). Discrepancies among raters were significant (p < 0.0001), questioning inter-rater reliability. No substantial differences were noted in answer distribution by question (p = 0.43). The answers on diagnosis (Q2) and causes (Q4) of AIS were top-rated. The most dissatisfaction was seen in the answers regarding definitions (Q1) and long-term results (Q7). Exhaustiveness, clarity, empathy, and length of the answers were positively rated (> 3.0 on 5.0) and did not demonstrate any differences among LLMs. However, GPT-3.5 struggled with language suitability and empathy, while Bard's responses were overly detailed and less empathetic. Overall, raters found that 9% of answers were off-topic and 22% contained clear mistakes.

Conclusion: Our study offers crucial insights into the strengths and weaknesses of current LLMs in AIS patient and parent education, highlighting the promise of advancements like ChatGPT-4.o and Gemini alongside the need for continuous improvement in empathy, contextual understanding, and language appropriateness.

Abstract Image

查看原文本刊更多论文

在对患者进行青少年特发性脊柱侧凸教育时，大型语言模型提供的信息是否有效？对内容、清晰度和移情能力的评估：欧洲脊柱研究小组的观点。

目的：大型语言模型（LLM）有可能弥补患者教育方面的知识差距，丰富患者与外科医生之间的互动。本研究对三个聊天机器人进行了评估，这些聊天机器人能以感同身受的方式提供准确的青少年特发性脊柱侧弯症（AIS）相关信息和管理建议。具体而言，我们评估了所提供信息的准确性、清晰度和相关性，旨在确定 LLM 在解决患者常见疑问和增强他们对 AIS 的理解方面的有效性：我们收集了 20 个有关 AIS 的常见问题（FAQ）网页，并根据这些问题制定了 10 个关键问题。我们选择了三种高级 LLM--ChatGPT 3.5、ChatGPT 4.0 和 Google Bard 来回答这些问题，回答字数限制在 200 字以内。来自欧洲七个脊柱中心的经验丰富的畸形外科医生（欧洲脊柱研究小组成员）组成的盲人小组对 LLMs 的回答进行了评估。评估采用预先设定的从 "优秀 "到 "不满意 "的 4 级评分系统，并根据清晰度、全面性和同理心采用 5 点李克特量表进行进一步评分。如果没有被评为 "优秀"，则要求评分者报告他们对每个问题做出决定的原因。最后，在六个问题中询问了评分者对医疗保健领域人工智能的总体看法：结果：在所有 LLM 的回答中，26% 的回答为 "优秀"，其中 ChatGPT-4.0 领先（39%），其次是 Bard（17%）。ChatGPT-4.0 的评分优于 Bard 和 ChatGPT 3.5（p = 0.003）。评分者之间的差异非常明显（p 3.0 对 5.0），并没有显示出 LLM 之间的任何差异。然而，GPT-3.5 在语言适宜性和移情方面存在问题，而 Bard 的回答过于详细，移情程度较低。总体而言，评分者发现 9% 的答案偏离主题，22% 的答案存在明显错误：我们的研究为了解当前 AIS 患者和家长教育中 LLM 的优缺点提供了重要的见解，强调了 ChatGPT-4.o 和 Gemini 等先进技术的前景，以及在移情、语境理解和语言适当性方面不断改进的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Spine deformity

CiteScore

3.20

自引率

18.80%

发文量

167

期刊介绍： Spine Deformity the official journal of the?Scoliosis Research Society is a peer-refereed publication to disseminate knowledge on basic science and clinical research into the?etiology?biomechanics?treatment?methods and outcomes of all types of?spinal deformities. The international members of the Editorial Board provide a worldwide perspective for the journal's area of interest.The?journal?will enhance the mission of the Society which is to foster the optimal care of all patients with?spine?deformities worldwide. Articles published in?Spine Deformity?are Medline indexed in PubMed.? The journal publishes original articles in the form of clinical and basic research. Spine Deformity will only publish studies that have institutional review board (IRB) or similar ethics committee approval for human and animal studies and have strictly observed these guidelines. The minimum follow-up period for follow-up clinical studies is 24 months.