Can a Large Language Model Interpret Data in the Electronic Health Record to Infer Minimum Clinically Important Difference Achievement of Knee Osteoarthritis Outcome Score-Joint Replacement Score Following Total Knee Arthroplasty?
Abdul Zalikha, Thomas S Hong, Easton Small, Michael Constant, Alex H S Harris, Nicholas J Giori
{"title":"Can a Large Language Model Interpret Data in the Electronic Health Record to Infer Minimum Clinically Important Difference Achievement of Knee Osteoarthritis Outcome Score-Joint Replacement Score Following Total Knee Arthroplasty?","authors":"Abdul Zalikha, Thomas S Hong, Easton Small, Michael Constant, Alex H S Harris, Nicholas J Giori","doi":"10.1016/j.arth.2025.03.049","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Obtaining total knee arthroplasty (TKA) patient-reported outcomes for quality assessment is costly and difficult. We asked whether a large language model (LLM) could interpret electronic health record (EHR) notes to differentiate patients attaining a one-year minimum clinically important difference (MCID) for the Knee Osteoarthritis Outcome Score-Joint Replacement (KOOS-JR) from those who did not. We also investigated whether sufficient information to infer MCID achievement exists in the chart by having a blinded orthopaedic surgeon make the same determination.</p><p><strong>Methods: </strong>In this retrospective case-control study, we selected 40 TKA patients who achieved 1-year KOOS-JR MCID and 40 who did not. Orthopaedic, emergency medicine, and primary care notes from zero to six months preoperatively and nine to 15 months postoperatively were deidentified. ChatGPT 3.5 interpreted these notes to determine whether the patient improved after surgery. A blinded orthopaedic surgeon classified these patients using all chart information. The sensitivity, specificity, and accuracy of ChatGPT 3.5 and the surgeon's responses were calculated.</p><p><strong>Results: </strong>ChatGPT 3.5 classified 78 of 80 cases with 97% sensitivity, but only 33% specificity. The surgeon's assessment had 90% sensitivity and 63% specificity. Given the equal distribution of patients meeting or not meeting MCID, Chat GPT's accuracy was 65%. The surgeon's was 76%.</p><p><strong>Discussion: </strong>ChatGPT's assessment of KOOS-JR MCID attainment had 97% sensitivity, but only 33% specificity. False positives were commonly due to the LLM not having access to, or not properly interpreting, signs of problems in the chart. This was an initial evaluation of the current ability of a general-purpose LLM to evaluate patient outcomes based on information in chart notes. An orthopaedic surgeon's assessment of the full chart suggests an opportunity to improve on this baseline performance, possibly enabling quality monitoring and identification of best practices across a large health care system. Additional work is needed to optimize model performance and confirm the utility of this approach.</p>","PeriodicalId":51077,"journal":{"name":"Journal of Arthroplasty","volume":" ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Arthroplasty","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.arth.2025.03.049","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Obtaining total knee arthroplasty (TKA) patient-reported outcomes for quality assessment is costly and difficult. We asked whether a large language model (LLM) could interpret electronic health record (EHR) notes to differentiate patients attaining a one-year minimum clinically important difference (MCID) for the Knee Osteoarthritis Outcome Score-Joint Replacement (KOOS-JR) from those who did not. We also investigated whether sufficient information to infer MCID achievement exists in the chart by having a blinded orthopaedic surgeon make the same determination.
Methods: In this retrospective case-control study, we selected 40 TKA patients who achieved 1-year KOOS-JR MCID and 40 who did not. Orthopaedic, emergency medicine, and primary care notes from zero to six months preoperatively and nine to 15 months postoperatively were deidentified. ChatGPT 3.5 interpreted these notes to determine whether the patient improved after surgery. A blinded orthopaedic surgeon classified these patients using all chart information. The sensitivity, specificity, and accuracy of ChatGPT 3.5 and the surgeon's responses were calculated.
Results: ChatGPT 3.5 classified 78 of 80 cases with 97% sensitivity, but only 33% specificity. The surgeon's assessment had 90% sensitivity and 63% specificity. Given the equal distribution of patients meeting or not meeting MCID, Chat GPT's accuracy was 65%. The surgeon's was 76%.
Discussion: ChatGPT's assessment of KOOS-JR MCID attainment had 97% sensitivity, but only 33% specificity. False positives were commonly due to the LLM not having access to, or not properly interpreting, signs of problems in the chart. This was an initial evaluation of the current ability of a general-purpose LLM to evaluate patient outcomes based on information in chart notes. An orthopaedic surgeon's assessment of the full chart suggests an opportunity to improve on this baseline performance, possibly enabling quality monitoring and identification of best practices across a large health care system. Additional work is needed to optimize model performance and confirm the utility of this approach.
期刊介绍:
The Journal of Arthroplasty brings together the clinical and scientific foundations for joint replacement. This peer-reviewed journal publishes original research and manuscripts of the highest quality from all areas relating to joint replacement or the treatment of its complications, including those dealing with clinical series and experience, prosthetic design, biomechanics, biomaterials, metallurgy, biologic response to arthroplasty materials in vivo and in vitro.