作为补充，而非替代：ChatGPT对常见肘部病理反应的准确性和完整性。

IF 1.1 Q3 ORTHOPEDICS

Shoulder and Elbow Pub Date : 2025-10-06 DOI:10.1177/17585732251365178

Benjamin Fiedler, Umar Ghilzai, Abdullah Ghali, Phillip Goldman, Pablo Coello, Michael B Gottschalk, Eric R Wagner, Adil Shahzad Ahmed

{"title":"作为补充，而非替代：ChatGPT对常见肘部病理反应的准确性和完整性。","authors":"Benjamin Fiedler, Umar Ghilzai, Abdullah Ghali, Phillip Goldman, Pablo Coello, Michael B Gottschalk, Eric R Wagner, Adil Shahzad Ahmed","doi":"10.1177/17585732251365178","DOIUrl":null,"url":null,"abstract":"Hypothesis: Large language models (LLMs) like ChatGPT have increasingly been used as online resources for patients with orthopedic conditions. Yet there is a paucity of information assessing the ability of LLMs to accurately and completely answer patient questions. The present study comparatively assessed both ChatGPT 3.5 and GPT-4 responses to frequently asked questions on common elbow pathologies, scoring for accuracy and completeness. It was hypothesized that ChatGPT 3.5 and GPT-4 would demonstrate high levels of accuracy for the specific query asked, but some responses would lack completeness, and GPT-4 would yield more accurate and complete responses than ChatGPT 3.5.Methods: ChatGPT was queried to identify five most common elbow pathologies (lateral epicondylitis, medial epicondylitis, cubital tunnel syndrome, distal biceps rupture, elbow arthritis). ChatGPT was then queried on the five most frequently asked questions for each elbow pathology. These 25 total questions were then individually asked of ChatGPT 3.5 and GPT-4. Responses were recorded and scored on 6-point Likert scale for accuracy and 3-point Likert scale for completeness by three fellowship-trained upper extremity orthopedic surgeons. ChatGPT 3.5 and GPT-4 responses were compared for each pathology using two-tailed t-tests.Results: Average accuracy scores for ChatGPT 3.5 ranged from 4.80 to 4.87. Average GPT-4 accuracy scores ranged from 4.80 to 5.13. Average completeness scores for ChatGPT 3.5 ranged from 2.13 to 2.47, and average completeness scores for GPT-4 ranged from 2.47 to 2.80. Total average accuracy for ChatGPT 3.5 was 4.83, and total average accuracy for GPT-4 was 5.0 (p = 0.05). Total average completeness for ChatGPT 3.5 was 2.35, and total average completeness for GPT-4 was 2.66 (p = 0.01).Conclusion: ChatGPT 3.5 and GPT-4 are capable of providing accurate and complete responses to frequently asked patient questions, with GPT-4 providing superior responses. Large language models like ChatGPT have potential to serve as a reliable online resource for patients with elbow conditions.","PeriodicalId":36705,"journal":{"name":"Shoulder and Elbow","volume":" ","pages":"17585732251365178"},"PeriodicalIF":1.1000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12500603/pdf/","citationCount":"0","resultStr":"{\"title\":\"A supplement, not a substitute: Accuracy and completeness of ChatGPT responses for common elbow pathology.\",\"authors\":\"Benjamin Fiedler, Umar Ghilzai, Abdullah Ghali, Phillip Goldman, Pablo Coello, Michael B Gottschalk, Eric R Wagner, Adil Shahzad Ahmed\",\"doi\":\"10.1177/17585732251365178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hypothesis: Large language models (LLMs) like ChatGPT have increasingly been used as online resources for patients with orthopedic conditions. Yet there is a paucity of information assessing the ability of LLMs to accurately and completely answer patient questions. The present study comparatively assessed both ChatGPT 3.5 and GPT-4 responses to frequently asked questions on common elbow pathologies, scoring for accuracy and completeness. It was hypothesized that ChatGPT 3.5 and GPT-4 would demonstrate high levels of accuracy for the specific query asked, but some responses would lack completeness, and GPT-4 would yield more accurate and complete responses than ChatGPT 3.5.Methods: ChatGPT was queried to identify five most common elbow pathologies (lateral epicondylitis, medial epicondylitis, cubital tunnel syndrome, distal biceps rupture, elbow arthritis). ChatGPT was then queried on the five most frequently asked questions for each elbow pathology. These 25 total questions were then individually asked of ChatGPT 3.5 and GPT-4. Responses were recorded and scored on 6-point Likert scale for accuracy and 3-point Likert scale for completeness by three fellowship-trained upper extremity orthopedic surgeons. ChatGPT 3.5 and GPT-4 responses were compared for each pathology using two-tailed t-tests.Results: Average accuracy scores for ChatGPT 3.5 ranged from 4.80 to 4.87. Average GPT-4 accuracy scores ranged from 4.80 to 5.13. Average completeness scores for ChatGPT 3.5 ranged from 2.13 to 2.47, and average completeness scores for GPT-4 ranged from 2.47 to 2.80. Total average accuracy for ChatGPT 3.5 was 4.83, and total average accuracy for GPT-4 was 5.0 (p = 0.05). Total average completeness for ChatGPT 3.5 was 2.35, and total average completeness for GPT-4 was 2.66 (p = 0.01).Conclusion: ChatGPT 3.5 and GPT-4 are capable of providing accurate and complete responses to frequently asked patient questions, with GPT-4 providing superior responses. Large language models like ChatGPT have potential to serve as a reliable online resource for patients with elbow conditions.\",\"PeriodicalId\":36705,\"journal\":{\"name\":\"Shoulder and Elbow\",\"volume\":\" \",\"pages\":\"17585732251365178\"},\"PeriodicalIF\":1.1000,\"publicationDate\":\"2025-10-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12500603/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Shoulder and Elbow\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/17585732251365178\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Shoulder and Elbow","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/17585732251365178","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

摘要

假设：像ChatGPT这样的大型语言模型（llm）越来越多地被用作骨科患者的在线资源。然而，评估法学硕士准确、完整地回答患者问题的能力的信息却很缺乏。本研究比较评估了ChatGPT 3.5和GPT-4对常见肘部病变常见问题的回答，对准确性和完整性进行评分。假设ChatGPT 3.5和GPT-4对特定查询的准确性较高，但有些响应缺乏完整性，而GPT-4将产生比ChatGPT 3.5更准确和完整的响应。方法：对ChatGPT进行查询，确定5种最常见的肘部病变（外侧上髁炎、内侧上髁炎、肘管综合征、远端肱二头肌破裂、肘关节关节炎）。然后询问ChatGPT关于每个肘部病理的五个最常见问题。然后在ChatGPT 3.5和GPT-4中分别提出这25个问题。由三位研究员训练的上肢矫形外科医生记录并以6分李克特量表的准确性和3分李克特量表的完整性评分。采用双尾t检验比较每种病理的ChatGPT 3.5和GPT-4反应。结果：ChatGPT 3.5的平均准确率得分在4.80 ~ 4.87之间。平均GPT-4准确度得分在4.80到5.13之间。ChatGPT 3.5的平均完备性得分在2.13 - 2.47之间，GPT-4的平均完备性得分在2.47 - 2.80之间。ChatGPT 3.5的总平均准确率为4.83，GPT-4的总平均准确率为5.0 （p = 0.05）。ChatGPT 3.5的平均完成度为2.35，GPT-4的平均完成度为2.66 （p = 0.01）。结论：ChatGPT 3.5和GPT-4能够准确完整地回答患者的常见问题，其中GPT-4的回答更优。像ChatGPT这样的大型语言模型有潜力成为肘部疾病患者可靠的在线资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A supplement, not a substitute: Accuracy and completeness of ChatGPT responses for common elbow pathology.

Hypothesis: Large language models (LLMs) like ChatGPT have increasingly been used as online resources for patients with orthopedic conditions. Yet there is a paucity of information assessing the ability of LLMs to accurately and completely answer patient questions. The present study comparatively assessed both ChatGPT 3.5 and GPT-4 responses to frequently asked questions on common elbow pathologies, scoring for accuracy and completeness. It was hypothesized that ChatGPT 3.5 and GPT-4 would demonstrate high levels of accuracy for the specific query asked, but some responses would lack completeness, and GPT-4 would yield more accurate and complete responses than ChatGPT 3.5.

Methods: ChatGPT was queried to identify five most common elbow pathologies (lateral epicondylitis, medial epicondylitis, cubital tunnel syndrome, distal biceps rupture, elbow arthritis). ChatGPT was then queried on the five most frequently asked questions for each elbow pathology. These 25 total questions were then individually asked of ChatGPT 3.5 and GPT-4. Responses were recorded and scored on 6-point Likert scale for accuracy and 3-point Likert scale for completeness by three fellowship-trained upper extremity orthopedic surgeons. ChatGPT 3.5 and GPT-4 responses were compared for each pathology using two-tailed t-tests.

Results: Average accuracy scores for ChatGPT 3.5 ranged from 4.80 to 4.87. Average GPT-4 accuracy scores ranged from 4.80 to 5.13. Average completeness scores for ChatGPT 3.5 ranged from 2.13 to 2.47, and average completeness scores for GPT-4 ranged from 2.47 to 2.80. Total average accuracy for ChatGPT 3.5 was 4.83, and total average accuracy for GPT-4 was 5.0 (p = 0.05). Total average completeness for ChatGPT 3.5 was 2.35, and total average completeness for GPT-4 was 2.66 (p = 0.01).

Conclusion: ChatGPT 3.5 and GPT-4 are capable of providing accurate and complete responses to frequently asked patient questions, with GPT-4 providing superior responses. Large language models like ChatGPT have potential to serve as a reliable online resource for patients with elbow conditions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Shoulder and Elbow Medicine-Rehabilitation

CiteScore

2.80

自引率

0.00%

发文量