Can large language models address unmet patient information needs and reduce provider burnout in the management of thyroid disease?

IF 3.2 2区医学 Q1 SURGERY

Surgery Pub Date : 2024-10-17 DOI:10.1016/j.surg.2024.06.075

Rajam Raghunathan, Anna R Jacobs, Vivek R Sant, Lizabeth J King, Gary Rothberger, Jason Prescott, John Allendorf, Carolyn D Seib, Kepal N Patel, Insoo Suh

{"title":"Can large language models address unmet patient information needs and reduce provider burnout in the management of thyroid disease?","authors":"Rajam Raghunathan, Anna R Jacobs, Vivek R Sant, Lizabeth J King, Gary Rothberger, Jason Prescott, John Allendorf, Carolyn D Seib, Kepal N Patel, Insoo Suh","doi":"10.1016/j.surg.2024.06.075","DOIUrl":null,"url":null,"abstract":"Background: Patient electronic messaging has increased clinician workload contributing to burnout. Large language models can respond to these patient queries, but no studies exist on large language model responses in thyroid disease.Methods: This cross-sectional study randomly selected 33 of 52 patient questions found on Reddit/askdocs. Questions were found through a \"thyroid + cancer\" or \"thyroid + disease\" search and had verified-physician responses. Additional responses were generated using ChatGPT-3.5 and GPT-4. Questions and responses were anonymized and graded for accuracy, quality, and empathy using a 4-point Likert scale by blinded providers, including 4 surgeons, 1 endocrinologist, and 2 physician assistants (n = 7). Results were analyzed using a single-factor analysis of variance.Results: For accuracy, the results averaged 2.71/4 (standard deviation 1.04), 3.49/4 (0.391), and 3.66/4 (0.286) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = completely true information, 3 = greater than 50% true information, and 2 = less than 50% true information. For quality, the results were 2.37/4 (standard deviation 0.661), 2.98/4 (0.352), and 3.81/4 (0.36) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = provided information beyond what was asked, 3 = completely answers the question, and 2 = partially answers the question. For empathy, the mean scores were 2.37/4 (standard deviation 0.661), 2.80/4 (0.582), and 3.14/4 (0.578) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = anticipates and infers patient feelings from the expressed question, 3 = mirrors the patient's feelings, and 2 = contains no dismissive comments. Responses by GPT were ranked first 95% of the time.Conclusions: Large language model responses to patient queries about thyroid disease have the potential to be more accurate, complete, empathetic, and consistent than physician responses.","PeriodicalId":22152,"journal":{"name":"Surgery","volume":" ","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.surg.2024.06.075","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SURGERY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Patient electronic messaging has increased clinician workload contributing to burnout. Large language models can respond to these patient queries, but no studies exist on large language model responses in thyroid disease.

Methods: This cross-sectional study randomly selected 33 of 52 patient questions found on Reddit/askdocs. Questions were found through a "thyroid + cancer" or "thyroid + disease" search and had verified-physician responses. Additional responses were generated using ChatGPT-3.5 and GPT-4. Questions and responses were anonymized and graded for accuracy, quality, and empathy using a 4-point Likert scale by blinded providers, including 4 surgeons, 1 endocrinologist, and 2 physician assistants (n = 7). Results were analyzed using a single-factor analysis of variance.

Results: For accuracy, the results averaged 2.71/4 (standard deviation 1.04), 3.49/4 (0.391), and 3.66/4 (0.286) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = completely true information, 3 = greater than 50% true information, and 2 = less than 50% true information. For quality, the results were 2.37/4 (standard deviation 0.661), 2.98/4 (0.352), and 3.81/4 (0.36) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = provided information beyond what was asked, 3 = completely answers the question, and 2 = partially answers the question. For empathy, the mean scores were 2.37/4 (standard deviation 0.661), 2.80/4 (0.582), and 3.14/4 (0.578) for physicians, GPT-3.5, and GPT-4, respectively (P < .01), where 4 = anticipates and infers patient feelings from the expressed question, 3 = mirrors the patient's feelings, and 2 = contains no dismissive comments. Responses by GPT were ranked first 95% of the time.

Conclusions: Large language model responses to patient queries about thyroid disease have the potential to be more accurate, complete, empathetic, and consistent than physician responses.

查看原文本刊更多论文

在甲状腺疾病的治疗过程中，大型语言模型能否满足患者未得到满足的信息需求并减少医疗服务提供者的职业倦怠？

背景：患者的电子信息增加了临床医生的工作量，从而导致职业倦怠。大型语言模型可以回应这些患者的询问，但目前还没有关于甲状腺疾病大型语言模型回应的研究：这项横断面研究随机选择了在 Reddit/askdocs 上发现的 52 个患者问题中的 33 个。这些问题是通过 "甲状腺 + 癌症 "或 "甲状腺 + 疾病 "搜索找到的，并有经过验证的医生回复。其他回复使用 ChatGPT-3.5 和 GPT-4 生成。问题和回复均经过匿名处理，并由包括 4 名外科医生、1 名内分泌科医生和 2 名医生助理（n = 7）在内的盲人医疗服务提供者使用 4 点李克特量表对准确性、质量和移情能力进行评分。结果采用单因素方差分析法进行分析：在准确性方面，内科医生、GPT-3.5 和 GPT-4 的平均准确性分别为 2.71/4（标准偏差 1.04）、3.49/4（0.391）和 3.66/4（0.286）（P < .01），其中 4 = 完全真实信息，3 = 真实信息超过 50%，2 = 真实信息少于 50%。在质量方面，医生、GPT-3.5 和 GPT-4 的结果分别为 2.37/4（标准偏差 0.661）、2.98/4（0.352）和 3.81/4（0.36）（P < .01），其中 4 = 提供了超出要求的信息，3 = 完全回答了问题，2 = 部分回答了问题。在移情方面，医生、GPT-3.5 和 GPT-4 的平均得分分别为 2.37/4（标准偏差 0.661）、2.80/4（0.582）和 3.14/4（0.578）（P < .01），其中 4 = 从所表达的问题中预测并推断出患者的感受，3 = 反映患者的感受，2 = 不包含轻蔑性评论。在 95% 的情况下，GPT 的回复排在第一位：结论：与医生的回答相比，大语言模型对患者有关甲状腺疾病询问的回答有可能更加准确、完整、富有同情心且前后一致。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Surgery 医学-外科

CiteScore

5.40

自引率

5.30%

发文量

687

审稿时长

64 days

期刊介绍： For 66 years, Surgery has published practical, authoritative information about procedures, clinical advances, and major trends shaping general surgery. Each issue features original scientific contributions and clinical reports. Peer-reviewed articles cover topics in oncology, trauma, gastrointestinal, vascular, and transplantation surgery. The journal also publishes papers from the meetings of its sponsoring societies, the Society of University Surgeons, the Central Surgical Association, and the American Association of Endocrine Surgeons.