Jie Deng , Lun Li , Jelle J. Oosterhof , Peter Malliaras , Karin Grävare Silbernagel , Stephan J. Breda , Denise Eygendaal , Edwin HG. Oei , Robert-Jan de Vos
{"title":"ChatGPT is a comprehensive education tool for patients with patellar tendinopathy, but it currently lacks accuracy and readability","authors":"Jie Deng , Lun Li , Jelle J. Oosterhof , Peter Malliaras , Karin Grävare Silbernagel , Stephan J. Breda , Denise Eygendaal , Edwin HG. Oei , Robert-Jan de Vos","doi":"10.1016/j.msksp.2025.103275","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Generative artificial intelligence tools, such as ChatGPT, are becoming increasingly integrated into daily life, and patients might turn to this tool to seek medical information.</div></div><div><h3>Objective</h3><div>To evaluate the performance of ChatGPT-4 in responding to patient-centered queries for patellar tendinopathy (PT).</div></div><div><h3>Methods</h3><div>Forty-eight patient-centered queries were collected from online sources, PT patients, and experts and were then submitted to ChatGPT-4. Three board-certified experts independently assessed the accuracy and comprehensiveness of the responses. Readability was measured using the Flesch-Kincaid Grade Level (FKGL: higher scores indicate a higher grade reading level). The Patient Education Materials Assessment Tool (PEMAT) evaluated understandability, and actionability (0–100%, higher scores indicate information with clearer messages and more identifiable actions). Semantic Textual Similarity (STS score, 0–1; higher scores indicate higher similarity) assessed variation in the meaning of texts over two months (including ChatGPT-4o) and for different terminologies related to PT.</div></div><div><h3>Results</h3><div>Sixteen (33%) of the 48 responses were rated accurate, while 36 (75%) were rated comprehensive. Only 17% of treatment-related questions received accurate responses. Most responses were written at a college reading level (median and interquartile range [IQR] of FKGL score: 15.4 [14.4–16.6]). The median of PEMAT for understandability was 83% (IQR: 70%–92%), and for actionability, it was 60% (IQR: 40%–60%). The medians of STS scores in the meaning of texts over two months and across terminologies were all ≥ 0.9.</div></div><div><h3>Conclusions</h3><div>ChatGPT-4 provided generally comprehensive information in response to patient-centered queries but lacked accuracy and was difficult to read for individuals below a college reading level.</div></div>","PeriodicalId":56036,"journal":{"name":"Musculoskeletal Science and Practice","volume":"76 ","pages":"Article 103275"},"PeriodicalIF":2.2000,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Musculoskeletal Science and Practice","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468781225000232","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"REHABILITATION","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Generative artificial intelligence tools, such as ChatGPT, are becoming increasingly integrated into daily life, and patients might turn to this tool to seek medical information.
Objective
To evaluate the performance of ChatGPT-4 in responding to patient-centered queries for patellar tendinopathy (PT).
Methods
Forty-eight patient-centered queries were collected from online sources, PT patients, and experts and were then submitted to ChatGPT-4. Three board-certified experts independently assessed the accuracy and comprehensiveness of the responses. Readability was measured using the Flesch-Kincaid Grade Level (FKGL: higher scores indicate a higher grade reading level). The Patient Education Materials Assessment Tool (PEMAT) evaluated understandability, and actionability (0–100%, higher scores indicate information with clearer messages and more identifiable actions). Semantic Textual Similarity (STS score, 0–1; higher scores indicate higher similarity) assessed variation in the meaning of texts over two months (including ChatGPT-4o) and for different terminologies related to PT.
Results
Sixteen (33%) of the 48 responses were rated accurate, while 36 (75%) were rated comprehensive. Only 17% of treatment-related questions received accurate responses. Most responses were written at a college reading level (median and interquartile range [IQR] of FKGL score: 15.4 [14.4–16.6]). The median of PEMAT for understandability was 83% (IQR: 70%–92%), and for actionability, it was 60% (IQR: 40%–60%). The medians of STS scores in the meaning of texts over two months and across terminologies were all ≥ 0.9.
Conclusions
ChatGPT-4 provided generally comprehensive information in response to patient-centered queries but lacked accuracy and was difficult to read for individuals below a college reading level.
期刊介绍:
Musculoskeletal Science & Practice, international journal of musculoskeletal physiotherapy, is a peer-reviewed international journal (previously Manual Therapy), publishing high quality original research, review and Masterclass articles that contribute to improving the clinical understanding of appropriate care processes for musculoskeletal disorders. The journal publishes articles that influence or add to the body of evidence on diagnostic and therapeutic processes, patient centered care, guidelines for musculoskeletal therapeutics and theoretical models that support developments in assessment, diagnosis, clinical reasoning and interventions.