评估聊天生成预训练变压器对有关先天性上肢差异的常见患者问题的反应的准确性

Q3 Medicine

Journal of Hand Surgery Global Online Pub Date : 2025-05-31 DOI:10.1016/j.jhsg.2025.100764

Niklaus P. Zeller BS , Ayush D. Shah BA , Ann E. Van Heest MD , Deborah C. Bohn MD

{"title":"评估聊天生成预训练变压器对有关先天性上肢差异的常见患者问题的反应的准确性","authors":"Niklaus P. Zeller BS , Ayush D. Shah BA , Ann E. Van Heest MD , Deborah C. Bohn MD","doi":"10.1016/j.jhsg.2025.100764","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>The purpose was to assess the ability of Chat Generative Pre-Trained Transformer (ChatGPT) 4.0 to accurately and reliably answer patients’ frequently asked questions (FAQs) about congenital upper limb differences (CULDs) and their treatment options.</div></div><div><h3>Methods</h3><div>Two pediatric hand surgeons were queried regarding FAQs they receive from parents about CULDs. Sixteen FAQs were input to ChatGPT-4.0 for the following conditions: (1) syndactyly, (2) polydactyly, (3) radial longitudinal deficiency, (4) thumb hypoplasia, and (5) general congenital hand differences. Two additional psychosocial care questions were queried, and all responses were graded by the surgeons using a scale of 1–4, based on the quality of the response. Independent chats were used for each question to reduce memory–retention bias with no pretraining of the software application.</div></div><div><h3>Results</h3><div>Overall, ChatGPT provided relatively reliable, evidence-based responses to the 16 queried FAQs. In total, 164 grades were assigned to the 82 ChatGPT responses: 83 (51%) did not require any clarification, 37 (23%) required minimal clarification, 32 (20%) required moderate clarification, and 13 (8%) received an unsatisfactory rating. However, there was considerable variability in the depth of many responses. When queried on medical associations with syndactyly and polydactyly, ChatGPT provided a detailed account of associated syndromes, although there was no mention that syndromic involvement is relatively rare. Furthermore, ChatGPT recommended that the patients consult a health care provider for individualized care 81 times in 49 responses. It commonly “referred” patients to genetic counselors (n = 26, 32%), followed by pediatric orthopedic surgeons and orthopedic surgeons (n = 16, 20%), and hand surgeons (n = 9, 11%).</div></div><div><h3>Conclusions</h3><div>Chat Generative Pre-Trained Transformer provided evidence-based responses not requiring clarification to a majority of FAQs about CULDs. However, there was considerable variation across the responses, and it rarely “referred” patients to hand surgeons. As new tools for patient education, ChatGPT and similar large language models should be approached cautiously when seeking information about CULDs. Responses do not consistently provide comprehensive, individualized information. 8% of responses were misguiding.</div></div><div><h3>Type of study/level of evidence</h3><div>Economic/decision analysis IIC.</div></div>","PeriodicalId":36920,"journal":{"name":"Journal of Hand Surgery Global Online","volume":"7 4","pages":"Article 100764"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing Accuracy of Chat Generative Pre-Trained Transformer’s Responses to Common Patient Questions Regarding Congenital Upper Limb Differences\",\"authors\":\"Niklaus P. Zeller BS , Ayush D. Shah BA , Ann E. Van Heest MD , Deborah C. Bohn MD\",\"doi\":\"10.1016/j.jhsg.2025.100764\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Purpose</h3><div>The purpose was to assess the ability of Chat Generative Pre-Trained Transformer (ChatGPT) 4.0 to accurately and reliably answer patients’ frequently asked questions (FAQs) about congenital upper limb differences (CULDs) and their treatment options.</div></div><div><h3>Methods</h3><div>Two pediatric hand surgeons were queried regarding FAQs they receive from parents about CULDs. Sixteen FAQs were input to ChatGPT-4.0 for the following conditions: (1) syndactyly, (2) polydactyly, (3) radial longitudinal deficiency, (4) thumb hypoplasia, and (5) general congenital hand differences. Two additional psychosocial care questions were queried, and all responses were graded by the surgeons using a scale of 1–4, based on the quality of the response. Independent chats were used for each question to reduce memory–retention bias with no pretraining of the software application.</div></div><div><h3>Results</h3><div>Overall, ChatGPT provided relatively reliable, evidence-based responses to the 16 queried FAQs. In total, 164 grades were assigned to the 82 ChatGPT responses: 83 (51%) did not require any clarification, 37 (23%) required minimal clarification, 32 (20%) required moderate clarification, and 13 (8%) received an unsatisfactory rating. However, there was considerable variability in the depth of many responses. When queried on medical associations with syndactyly and polydactyly, ChatGPT provided a detailed account of associated syndromes, although there was no mention that syndromic involvement is relatively rare. Furthermore, ChatGPT recommended that the patients consult a health care provider for individualized care 81 times in 49 responses. It commonly “referred” patients to genetic counselors (n = 26, 32%), followed by pediatric orthopedic surgeons and orthopedic surgeons (n = 16, 20%), and hand surgeons (n = 9, 11%).</div></div><div><h3>Conclusions</h3><div>Chat Generative Pre-Trained Transformer provided evidence-based responses not requiring clarification to a majority of FAQs about CULDs. However, there was considerable variation across the responses, and it rarely “referred” patients to hand surgeons. As new tools for patient education, ChatGPT and similar large language models should be approached cautiously when seeking information about CULDs. Responses do not consistently provide comprehensive, individualized information. 8% of responses were misguiding.</div></div><div><h3>Type of study/level of evidence</h3><div>Economic/decision analysis IIC.</div></div>\",\"PeriodicalId\":36920,\"journal\":{\"name\":\"Journal of Hand Surgery Global Online\",\"volume\":\"7 4\",\"pages\":\"Article 100764\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Hand Surgery Global Online\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2589514125000842\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Hand Surgery Global Online","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589514125000842","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

目的评估聊天生成预训练转换器（ChatGPT） 4.0准确可靠地回答患者关于先天性上肢差异（CULDs）及其治疗方案的常见问题（FAQs）的能力。方法对两名儿科手外科医生进行问卷调查，询问他们从家长那里收到的关于CULDs的常见问题。针对以下情况，在ChatGPT-4.0中输入16个常见问题：(1)并指畸形，(2)多指畸形，(3)径向纵向缺陷，(4)拇指发育不全，(5)一般先天性手部差异。另外还询问了两个心理社会护理问题，所有的回答都由外科医生根据回答的质量用1-4分进行评分。每个问题都使用独立聊天，以减少记忆保留偏差，而不需要对软件应用程序进行预训练。结果总体而言，ChatGPT对16个查询的常见问题提供了相对可靠的、基于证据的回答。总共有164个等级被分配给82个ChatGPT回答：83个（51%）不需要任何澄清，37个（23%）需要最低限度的澄清，32个（20%）需要适度的澄清，13个（8%）得到了不满意的评级。然而，许多答复的深度有相当大的差异。当被问及与并指和多指的医学关联时，ChatGPT提供了相关综合征的详细描述，尽管没有提到综合征累及相对罕见。此外，ChatGPT在49份回复中建议患者咨询医疗保健提供者81次进行个性化护理。它通常将患者“推荐”给遗传咨询师（n = 26.32%），其次是儿科骨科医生和骨科医生（n = 16.20%），手外科医生（n = 9.11%）。结论schat生成预训练变压器提供了基于证据的响应，无需对大多数关于culd的常见问题进行澄清。然而，在不同的反应中存在相当大的差异，并且很少将患者“转介”给手外科医生。ChatGPT和类似的大型语言模型作为患者教育的新工具，在寻求CULDs信息时应谨慎对待。答复不能始终如一地提供全面、个性化的信息。8%的回答具有误导性。研究类型/证据水平经济/决策分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Assessing Accuracy of Chat Generative Pre-Trained Transformer’s Responses to Common Patient Questions Regarding Congenital Upper Limb Differences

Purpose

The purpose was to assess the ability of Chat Generative Pre-Trained Transformer (ChatGPT) 4.0 to accurately and reliably answer patients’ frequently asked questions (FAQs) about congenital upper limb differences (CULDs) and their treatment options.

Methods

Two pediatric hand surgeons were queried regarding FAQs they receive from parents about CULDs. Sixteen FAQs were input to ChatGPT-4.0 for the following conditions: (1) syndactyly, (2) polydactyly, (3) radial longitudinal deficiency, (4) thumb hypoplasia, and (5) general congenital hand differences. Two additional psychosocial care questions were queried, and all responses were graded by the surgeons using a scale of 1–4, based on the quality of the response. Independent chats were used for each question to reduce memory–retention bias with no pretraining of the software application.

Results

Overall, ChatGPT provided relatively reliable, evidence-based responses to the 16 queried FAQs. In total, 164 grades were assigned to the 82 ChatGPT responses: 83 (51%) did not require any clarification, 37 (23%) required minimal clarification, 32 (20%) required moderate clarification, and 13 (8%) received an unsatisfactory rating. However, there was considerable variability in the depth of many responses. When queried on medical associations with syndactyly and polydactyly, ChatGPT provided a detailed account of associated syndromes, although there was no mention that syndromic involvement is relatively rare. Furthermore, ChatGPT recommended that the patients consult a health care provider for individualized care 81 times in 49 responses. It commonly “referred” patients to genetic counselors (n = 26, 32%), followed by pediatric orthopedic surgeons and orthopedic surgeons (n = 16, 20%), and hand surgeons (n = 9, 11%).

Conclusions

Chat Generative Pre-Trained Transformer provided evidence-based responses not requiring clarification to a majority of FAQs about CULDs. However, there was considerable variation across the responses, and it rarely “referred” patients to hand surgeons. As new tools for patient education, ChatGPT and similar large language models should be approached cautiously when seeking information about CULDs. Responses do not consistently provide comprehensive, individualized information. 8% of responses were misguiding.