Rae Tarapore MD , Suhasini Gupta MD , Kenneth R. Means Jr MD , Aviram M. Giladi MD, MS
{"title":"人工智能可以回答桡骨远端骨折术后的问题——但患者能理解答案吗?","authors":"Rae Tarapore MD , Suhasini Gupta MD , Kenneth R. Means Jr MD , Aviram M. Giladi MD, MS","doi":"10.1016/j.jhsg.2025.100822","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>The purpose of this study was to assess the validity, reliability, and readability of responses to common patient questions about postoperative from ChatGPT, Microsoft Copilot, and Google Gemini.</div></div><div><h3>Methods</h3><div>Twenty-seven thoroughly vetted questions regarding distal radius fractures repair surgery were compiled and entered into ChatGPT 4, Gemini, and Copilot. The responses were analyzed for quality, accuracy, and readability using the DISCERN scale, the Journal of the American Medical Association benchmark criteria, Flesch-Kincaid Reading Ease Score, and Flesch-Kincaid Grade Level. Citations provided by Google Gemini and Microsoft Copilot were further categorized by source of reference. Five questions were resubmitted, requesting response simplification. The responses were re-evaluated using the same metrics.</div></div><div><h3>Results</h3><div>All three artificial intelligence platforms produced answers that were considered “good” quality (DISCERN scores >50). Copilot had the highest quality of information (68.3), followed by Gemini (62.9) and ChatGPT (52.9). The information provided by Copilot demonstrated the highest reliability, with a Journal of the American Medical Association benchmark criterion of 3 (of 4) compared with Gemini (1) and ChatGPT (0). All three platforms generated complex texts with Flesch-Kincaid Reading Ease Scores ranging between 35.8 and 41.4 and Flesch-Kincaid Grade Level scores between 10.5 and 12.1, indicating a minimum of high-school graduate reading level required. After simplification, Gemini’s reading level remained unchanged, whereas ChatGPT improved to that of a seventh-grade reading level and Copilot improved to that of an eighth-grade reading level. Copilot had a higher number of references (74) compared with Gemini (36).</div></div><div><h3>Conclusions</h3><div>All three platforms provided safe and reliable answers to postoperative questions about distal radius fractures. High reading levels provided by AI remain the biggest barrier to patient accessibility.</div></div><div><h3>Clinical relevance</h3><div>For the current state of mainstream AI platforms, they are best suited as adjunct tools to support, rather than replace, clinical communication from health care workers.</div></div>","PeriodicalId":36920,"journal":{"name":"Journal of Hand Surgery Global Online","volume":"7 6","pages":"Article 100822"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Artificial Intelligence Can Answer Postoperative Questions About Distal Radius Fractures—But Can Patients Understand the Answers?\",\"authors\":\"Rae Tarapore MD , Suhasini Gupta MD , Kenneth R. Means Jr MD , Aviram M. Giladi MD, MS\",\"doi\":\"10.1016/j.jhsg.2025.100822\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Purpose</h3><div>The purpose of this study was to assess the validity, reliability, and readability of responses to common patient questions about postoperative from ChatGPT, Microsoft Copilot, and Google Gemini.</div></div><div><h3>Methods</h3><div>Twenty-seven thoroughly vetted questions regarding distal radius fractures repair surgery were compiled and entered into ChatGPT 4, Gemini, and Copilot. The responses were analyzed for quality, accuracy, and readability using the DISCERN scale, the Journal of the American Medical Association benchmark criteria, Flesch-Kincaid Reading Ease Score, and Flesch-Kincaid Grade Level. Citations provided by Google Gemini and Microsoft Copilot were further categorized by source of reference. Five questions were resubmitted, requesting response simplification. The responses were re-evaluated using the same metrics.</div></div><div><h3>Results</h3><div>All three artificial intelligence platforms produced answers that were considered “good” quality (DISCERN scores >50). Copilot had the highest quality of information (68.3), followed by Gemini (62.9) and ChatGPT (52.9). The information provided by Copilot demonstrated the highest reliability, with a Journal of the American Medical Association benchmark criterion of 3 (of 4) compared with Gemini (1) and ChatGPT (0). All three platforms generated complex texts with Flesch-Kincaid Reading Ease Scores ranging between 35.8 and 41.4 and Flesch-Kincaid Grade Level scores between 10.5 and 12.1, indicating a minimum of high-school graduate reading level required. After simplification, Gemini’s reading level remained unchanged, whereas ChatGPT improved to that of a seventh-grade reading level and Copilot improved to that of an eighth-grade reading level. Copilot had a higher number of references (74) compared with Gemini (36).</div></div><div><h3>Conclusions</h3><div>All three platforms provided safe and reliable answers to postoperative questions about distal radius fractures. High reading levels provided by AI remain the biggest barrier to patient accessibility.</div></div><div><h3>Clinical relevance</h3><div>For the current state of mainstream AI platforms, they are best suited as adjunct tools to support, rather than replace, clinical communication from health care workers.</div></div>\",\"PeriodicalId\":36920,\"journal\":{\"name\":\"Journal of Hand Surgery Global Online\",\"volume\":\"7 6\",\"pages\":\"Article 100822\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Hand Surgery Global Online\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2589514125001422\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Hand Surgery Global Online","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589514125001422","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}
Artificial Intelligence Can Answer Postoperative Questions About Distal Radius Fractures—But Can Patients Understand the Answers?
Purpose
The purpose of this study was to assess the validity, reliability, and readability of responses to common patient questions about postoperative from ChatGPT, Microsoft Copilot, and Google Gemini.
Methods
Twenty-seven thoroughly vetted questions regarding distal radius fractures repair surgery were compiled and entered into ChatGPT 4, Gemini, and Copilot. The responses were analyzed for quality, accuracy, and readability using the DISCERN scale, the Journal of the American Medical Association benchmark criteria, Flesch-Kincaid Reading Ease Score, and Flesch-Kincaid Grade Level. Citations provided by Google Gemini and Microsoft Copilot were further categorized by source of reference. Five questions were resubmitted, requesting response simplification. The responses were re-evaluated using the same metrics.
Results
All three artificial intelligence platforms produced answers that were considered “good” quality (DISCERN scores >50). Copilot had the highest quality of information (68.3), followed by Gemini (62.9) and ChatGPT (52.9). The information provided by Copilot demonstrated the highest reliability, with a Journal of the American Medical Association benchmark criterion of 3 (of 4) compared with Gemini (1) and ChatGPT (0). All three platforms generated complex texts with Flesch-Kincaid Reading Ease Scores ranging between 35.8 and 41.4 and Flesch-Kincaid Grade Level scores between 10.5 and 12.1, indicating a minimum of high-school graduate reading level required. After simplification, Gemini’s reading level remained unchanged, whereas ChatGPT improved to that of a seventh-grade reading level and Copilot improved to that of an eighth-grade reading level. Copilot had a higher number of references (74) compared with Gemini (36).
Conclusions
All three platforms provided safe and reliable answers to postoperative questions about distal radius fractures. High reading levels provided by AI remain the biggest barrier to patient accessibility.
Clinical relevance
For the current state of mainstream AI platforms, they are best suited as adjunct tools to support, rather than replace, clinical communication from health care workers.