Can a Large Language Model Interpret Data in the Electronic Health Record to Infer Minimum Clinically Important Difference Achievement of Knee Osteoarthritis Outcome Score-Joint Replacement Score Following Total Knee Arthroplasty?

IF 3.4 2区 医学 Q1 ORTHOPEDICS
Abdul Zalikha, Thomas S Hong, Easton Small, Michael Constant, Alex H S Harris, Nicholas J Giori
{"title":"Can a Large Language Model Interpret Data in the Electronic Health Record to Infer Minimum Clinically Important Difference Achievement of Knee Osteoarthritis Outcome Score-Joint Replacement Score Following Total Knee Arthroplasty?","authors":"Abdul Zalikha, Thomas S Hong, Easton Small, Michael Constant, Alex H S Harris, Nicholas J Giori","doi":"10.1016/j.arth.2025.03.049","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Obtaining total knee arthroplasty (TKA) patient-reported outcomes for quality assessment is costly and difficult. We asked whether a large language model (LLM) could interpret electronic health record (EHR) notes to differentiate patients attaining a one-year minimum clinically important difference (MCID) for the Knee Osteoarthritis Outcome Score-Joint Replacement (KOOS-JR) from those who did not. We also investigated whether sufficient information to infer MCID achievement exists in the chart by having a blinded orthopaedic surgeon make the same determination.</p><p><strong>Methods: </strong>In this retrospective case-control study, we selected 40 TKA patients who achieved 1-year KOOS-JR MCID and 40 who did not. Orthopaedic, emergency medicine, and primary care notes from zero to six months preoperatively and nine to 15 months postoperatively were deidentified. ChatGPT 3.5 interpreted these notes to determine whether the patient improved after surgery. A blinded orthopaedic surgeon classified these patients using all chart information. The sensitivity, specificity, and accuracy of ChatGPT 3.5 and the surgeon's responses were calculated.</p><p><strong>Results: </strong>ChatGPT 3.5 classified 78 of 80 cases with 97% sensitivity, but only 33% specificity. The surgeon's assessment had 90% sensitivity and 63% specificity. Given the equal distribution of patients meeting or not meeting MCID, Chat GPT's accuracy was 65%. The surgeon's was 76%.</p><p><strong>Discussion: </strong>ChatGPT's assessment of KOOS-JR MCID attainment had 97% sensitivity, but only 33% specificity. False positives were commonly due to the LLM not having access to, or not properly interpreting, signs of problems in the chart. This was an initial evaluation of the current ability of a general-purpose LLM to evaluate patient outcomes based on information in chart notes. An orthopaedic surgeon's assessment of the full chart suggests an opportunity to improve on this baseline performance, possibly enabling quality monitoring and identification of best practices across a large health care system. Additional work is needed to optimize model performance and confirm the utility of this approach.</p>","PeriodicalId":51077,"journal":{"name":"Journal of Arthroplasty","volume":" ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Arthroplasty","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.arth.2025.03.049","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Obtaining total knee arthroplasty (TKA) patient-reported outcomes for quality assessment is costly and difficult. We asked whether a large language model (LLM) could interpret electronic health record (EHR) notes to differentiate patients attaining a one-year minimum clinically important difference (MCID) for the Knee Osteoarthritis Outcome Score-Joint Replacement (KOOS-JR) from those who did not. We also investigated whether sufficient information to infer MCID achievement exists in the chart by having a blinded orthopaedic surgeon make the same determination.

Methods: In this retrospective case-control study, we selected 40 TKA patients who achieved 1-year KOOS-JR MCID and 40 who did not. Orthopaedic, emergency medicine, and primary care notes from zero to six months preoperatively and nine to 15 months postoperatively were deidentified. ChatGPT 3.5 interpreted these notes to determine whether the patient improved after surgery. A blinded orthopaedic surgeon classified these patients using all chart information. The sensitivity, specificity, and accuracy of ChatGPT 3.5 and the surgeon's responses were calculated.

Results: ChatGPT 3.5 classified 78 of 80 cases with 97% sensitivity, but only 33% specificity. The surgeon's assessment had 90% sensitivity and 63% specificity. Given the equal distribution of patients meeting or not meeting MCID, Chat GPT's accuracy was 65%. The surgeon's was 76%.

Discussion: ChatGPT's assessment of KOOS-JR MCID attainment had 97% sensitivity, but only 33% specificity. False positives were commonly due to the LLM not having access to, or not properly interpreting, signs of problems in the chart. This was an initial evaluation of the current ability of a general-purpose LLM to evaluate patient outcomes based on information in chart notes. An orthopaedic surgeon's assessment of the full chart suggests an opportunity to improve on this baseline performance, possibly enabling quality monitoring and identification of best practices across a large health care system. Additional work is needed to optimize model performance and confirm the utility of this approach.

背景:为质量评估获取全膝关节置换术(TKA)患者报告的结果既昂贵又困难。我们询问大语言模型(LLM)是否能解释电子健康记录(EHR)笔记,以区分膝关节骨关节炎结果评分-关节置换术(KOOS-JR)中达到一年最小临床重要差异(MCID)的患者和未达到一年最小临床重要差异(MCID)的患者。我们还调查了病历中是否存在足够的信息来推断MCID的实现情况,并让一位盲人骨科医生做出同样的判断:在这项回顾性病例对照研究中,我们选择了 40 名达到 1 年 KOOS-JR MCID 的 TKA 患者和 40 名未达到 MCID 的患者。对术前 0 到 6 个月和术后 9 到 15 个月的骨科、急诊和初级保健记录进行了去身份化处理。ChatGPT 3.5 对这些病历进行解读,以确定患者术后病情是否有所改善。一位双盲的骨科医生利用所有病历信息对这些患者进行了分类。计算了 ChatGPT 3.5 和外科医生回复的灵敏度、特异性和准确性:结果:ChatGPT 3.5 对 80 个病例中的 78 个进行了分类,灵敏度为 97%,但特异度仅为 33%。外科医生的评估灵敏度为 90%,特异度为 63%。鉴于符合或不符合 MCID 的患者分布相同,Chat GPT 的准确率为 65%。讨论:讨论:ChatGPT 对 KOOS-JR MCID 达标情况的评估灵敏度为 97%,但特异性仅为 33%。假阳性通常是由于 LLM 无法访问或未能正确理解病历中的问题迹象。这是对目前通用 LLM 根据病历记录信息评估患者预后能力的初步评估。骨科医生对完整病历的评估表明,有机会在此基础上进行改进,从而有可能在大型医疗保健系统中实现质量监控和最佳实践的识别。要优化模型性能并确认这种方法的实用性,还需要做更多的工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Arthroplasty
Journal of Arthroplasty 医学-整形外科
CiteScore
7.00
自引率
20.00%
发文量
734
审稿时长
48 days
期刊介绍: The Journal of Arthroplasty brings together the clinical and scientific foundations for joint replacement. This peer-reviewed journal publishes original research and manuscripts of the highest quality from all areas relating to joint replacement or the treatment of its complications, including those dealing with clinical series and experience, prosthetic design, biomechanics, biomaterials, metallurgy, biologic response to arthroplasty materials in vivo and in vitro.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信