人工智能肩关节置换术评分：开发和验证一种工具，用于大语言模型回答有关全肩关节置换术的常见患者问题

Q4 Medicine

Seminars in Arthroplasty Pub Date : 2025-03-06 DOI:10.1053/j.sart.2025.02.003

Benjamin Fiedler MD , Jeffrey Hauck BS , T. Bradley Edwards MD , Hussein A. Elkousy MD , Paul J. Cagle MD , Todd Phillips MD

{"title":"人工智能肩关节置换术评分：开发和验证一种工具，用于大语言模型回答有关全肩关节置换术的常见患者问题","authors":"Benjamin Fiedler MD , Jeffrey Hauck BS , T. Bradley Edwards MD , Hussein A. Elkousy MD , Paul J. Cagle MD , Todd Phillips MD","doi":"10.1053/j.sart.2025.02.003","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Hypothesis</h3><div>While research into artificial intelligence, specifically large language model (LLM), ability to respond to patient questions regarding specific orthopedic pathologies continues to grow, no tool presently exists to systematically and comprehensively evaluate the quality of LLM responses. The present study seeks to develop and validate the Artificial Intelligence Shoulder Arthroplasty Score (AISAS) to create a comprehensive, standardized, and reproducible system for evaluating artificial intelligence responses to patient questions regarding their orthopedic pathology.</div></div><div><h3>Methods</h3><div>The novel scoring tool, AISAS, was developed to include four equally weighted components related to accuracy, completeness, clarity, and readability. Fifteen common patient questions on glenohumeral arthritis were asked one by one to three of the most used LLMs: ChatGPT (version 3.5), Claude (version 3.5) Sonnet, and Gemini. Ten shoulder and elbow fellowship trained orthopedic surgeons used the proposed framework to evaluate each of the 45 responses. Inter-rater reliability was calculated via Cohen's kappa and rater-score correlation was calculated via Cronbach's alpha.</div></div><div><h3>Results</h3><div>AISAS use for Claude and ChatGPT produced moderate agreement (<em>k</em> = 0.55 and 0.43) while Gemini produced substantial reliability among raters ((<em>k</em> = 0.66). Cronbach's alpha scores demonstrated excellent correlation of Gemini ratings (⍺ = 0.91) and acceptable correlation of the Claude and ChatGPT ratings (⍺ = 0.79 and 0.75).</div></div><div><h3>Discussion and Conclusion</h3><div>AISAS use enables systematic assessment of the overall quality of an LLM response, as well as the individual components of a response that may vary in quality to enable easy comparisons for LLM responses. Furthermore, it offers a tool to trend the progress of LLMs in ability to respond to patient questions. Establishing such a framework to guide areas of improvement for LLMs will serve to optimize LLMs as a patient tool, identify areas for improvement, and allow physicians to better direct patients on how to utilize these tools for optimal use.</div></div><div><h3>Conclusion</h3><div>The AISAS is a comprehensive and reproducible tool for evaluating LLM responses, with high levels of inter-rater reliability. AISAS use can help to evaluate responses to patient questions to guide growth and improvement of LLMs for use in the orthopedic setting.</div></div>","PeriodicalId":39885,"journal":{"name":"Seminars in Arthroplasty","volume":"35 3","pages":"Pages 348-353"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The Artificial Intelligence Shoulder Arthroplasty Score: development and validation of a tool for large language model responses to common patient questions regarding total shoulder arthroplasty\",\"authors\":\"Benjamin Fiedler MD , Jeffrey Hauck BS , T. Bradley Edwards MD , Hussein A. Elkousy MD , Paul J. Cagle MD , Todd Phillips MD\",\"doi\":\"10.1053/j.sart.2025.02.003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background and Hypothesis</h3><div>While research into artificial intelligence, specifically large language model (LLM), ability to respond to patient questions regarding specific orthopedic pathologies continues to grow, no tool presently exists to systematically and comprehensively evaluate the quality of LLM responses. The present study seeks to develop and validate the Artificial Intelligence Shoulder Arthroplasty Score (AISAS) to create a comprehensive, standardized, and reproducible system for evaluating artificial intelligence responses to patient questions regarding their orthopedic pathology.</div></div><div><h3>Methods</h3><div>The novel scoring tool, AISAS, was developed to include four equally weighted components related to accuracy, completeness, clarity, and readability. Fifteen common patient questions on glenohumeral arthritis were asked one by one to three of the most used LLMs: ChatGPT (version 3.5), Claude (version 3.5) Sonnet, and Gemini. Ten shoulder and elbow fellowship trained orthopedic surgeons used the proposed framework to evaluate each of the 45 responses. Inter-rater reliability was calculated via Cohen's kappa and rater-score correlation was calculated via Cronbach's alpha.</div></div><div><h3>Results</h3><div>AISAS use for Claude and ChatGPT produced moderate agreement (<em>k</em> = 0.55 and 0.43) while Gemini produced substantial reliability among raters ((<em>k</em> = 0.66). Cronbach's alpha scores demonstrated excellent correlation of Gemini ratings (⍺ = 0.91) and acceptable correlation of the Claude and ChatGPT ratings (⍺ = 0.79 and 0.75).</div></div><div><h3>Discussion and Conclusion</h3><div>AISAS use enables systematic assessment of the overall quality of an LLM response, as well as the individual components of a response that may vary in quality to enable easy comparisons for LLM responses. Furthermore, it offers a tool to trend the progress of LLMs in ability to respond to patient questions. Establishing such a framework to guide areas of improvement for LLMs will serve to optimize LLMs as a patient tool, identify areas for improvement, and allow physicians to better direct patients on how to utilize these tools for optimal use.</div></div><div><h3>Conclusion</h3><div>The AISAS is a comprehensive and reproducible tool for evaluating LLM responses, with high levels of inter-rater reliability. AISAS use can help to evaluate responses to patient questions to guide growth and improvement of LLMs for use in the orthopedic setting.</div></div>\",\"PeriodicalId\":39885,\"journal\":{\"name\":\"Seminars in Arthroplasty\",\"volume\":\"35 3\",\"pages\":\"Pages 348-353\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Seminars in Arthroplasty\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1045452725000288\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Seminars in Arthroplasty","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1045452725000288","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

背景和假设虽然对人工智能的研究，特别是大语言模型（LLM），对患者关于特定骨科病理问题的回应能力的研究不断增长，但目前还没有工具可以系统和全面地评估LLM回应的质量。本研究旨在开发和验证人工智能肩关节置换术评分（AISAS），以创建一个全面、标准化和可重复的系统，用于评估人工智能对患者骨科病理问题的反应。方法开发了新的评分工具AISAS，包括与准确性、完整性、清晰度和可读性相关的四个等权重成分。15个关于肩关节关节炎的常见问题被逐一问及三个最常用的llm: ChatGPT（版本3.5），Claude（版本3.5）十四行诗和Gemini。10位接受过肩关节和肘部奖学金培训的骨科医生使用了建议的框架来评估45个反应中的每一个。评分者间信度采用Cohen's kappa法计算，评分者与评分者的相关性采用Cronbach's alpha法计算。结果在Claude和ChatGPT中使用saisas产生了中等程度的一致性（k = 0.55和0.43），而Gemini在评分者中产生了很大的可靠性（k = 0.66）。Cronbach alpha评分显示Gemini评分的相关性极佳（0.91），Claude评分和ChatGPT评分的相关性可接受（0.79和0.75）。aisas的使用可以系统地评估法学硕士响应的整体质量，以及响应中可能存在质量差异的单个组成部分，从而便于对法学硕士响应进行比较。此外，它提供了一种工具，以趋势法学硕士的能力，以回应病人的问题的进展。建立这样一个框架来指导法学硕士的改进领域，将有助于优化法学硕士作为一种患者工具，确定需要改进的领域，并允许医生更好地指导患者如何最佳地利用这些工具。结论AISAS是一种全面、可重复的评估法学硕士反应的工具，具有较高的评级间可靠性。AISAS的使用可以帮助评估对患者问题的反应，以指导llm在骨科环境中的发展和改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Artificial Intelligence Shoulder Arthroplasty Score: development and validation of a tool for large language model responses to common patient questions regarding total shoulder arthroplasty

Background and Hypothesis

While research into artificial intelligence, specifically large language model (LLM), ability to respond to patient questions regarding specific orthopedic pathologies continues to grow, no tool presently exists to systematically and comprehensively evaluate the quality of LLM responses. The present study seeks to develop and validate the Artificial Intelligence Shoulder Arthroplasty Score (AISAS) to create a comprehensive, standardized, and reproducible system for evaluating artificial intelligence responses to patient questions regarding their orthopedic pathology.

Methods

The novel scoring tool, AISAS, was developed to include four equally weighted components related to accuracy, completeness, clarity, and readability. Fifteen common patient questions on glenohumeral arthritis were asked one by one to three of the most used LLMs: ChatGPT (version 3.5), Claude (version 3.5) Sonnet, and Gemini. Ten shoulder and elbow fellowship trained orthopedic surgeons used the proposed framework to evaluate each of the 45 responses. Inter-rater reliability was calculated via Cohen's kappa and rater-score correlation was calculated via Cronbach's alpha.

Results

AISAS use for Claude and ChatGPT produced moderate agreement (k = 0.55 and 0.43) while Gemini produced substantial reliability among raters ((k = 0.66). Cronbach's alpha scores demonstrated excellent correlation of Gemini ratings (⍺ = 0.91) and acceptable correlation of the Claude and ChatGPT ratings (⍺ = 0.79 and 0.75).

Discussion and Conclusion

AISAS use enables systematic assessment of the overall quality of an LLM response, as well as the individual components of a response that may vary in quality to enable easy comparisons for LLM responses. Furthermore, it offers a tool to trend the progress of LLMs in ability to respond to patient questions. Establishing such a framework to guide areas of improvement for LLMs will serve to optimize LLMs as a patient tool, identify areas for improvement, and allow physicians to better direct patients on how to utilize these tools for optimal use.

Conclusion

The AISAS is a comprehensive and reproducible tool for evaluating LLM responses, with high levels of inter-rater reliability. AISAS use can help to evaluate responses to patient questions to guide growth and improvement of LLMs for use in the orthopedic setting.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Seminars in Arthroplasty Medicine-Surgery

CiteScore

1.00

自引率

0.00%

发文量

104

期刊介绍： Each issue of Seminars in Arthroplasty provides a comprehensive, current overview of a single topic in arthroplasty. The journal addresses orthopedic surgeons, providing authoritative reviews with emphasis on new developments relevant to their practice.