Identifying Deprescribing Opportunities With Large Language Models in Older Adults: Retrospective Cohort Study.

IF 5 Q1 GERIATRICS & GERONTOLOGY
JMIR Aging Pub Date : 2025-04-11 DOI:10.2196/69504
Vimig Socrates, Donald S Wright, Thomas Huang, Soraya Fereydooni, Christine Dien, Ling Chi, Jesse Albano, Brian Patterson, Naga Sasidhar Kanaparthy, Catherine X Wright, Andrew Loza, David Chartash, Mark Iscoe, Richard Andrew Taylor
{"title":"Identifying Deprescribing Opportunities With Large Language Models in Older Adults: Retrospective Cohort Study.","authors":"Vimig Socrates, Donald S Wright, Thomas Huang, Soraya Fereydooni, Christine Dien, Ling Chi, Jesse Albano, Brian Patterson, Naga Sasidhar Kanaparthy, Catherine X Wright, Andrew Loza, David Chartash, Mark Iscoe, Richard Andrew Taylor","doi":"10.2196/69504","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Polypharmacy, the concurrent use of multiple medications, is prevalent among older adults and associated with increased risks for adverse drug events including falls. Deprescribing, the systematic process of discontinuing potentially inappropriate medications, aims to mitigate these risks. However, the practical application of deprescribing criteria in emergency settings remains limited due to time constraints and criteria complexity.</p><p><strong>Objective: </strong>This study aims to evaluate the performance of a large language model (LLM)-based pipeline in identifying deprescribing opportunities for older emergency department (ED) patients with polypharmacy, using 3 different sets of criteria: Beers, Screening Tool of Older People's Prescriptions, and Geriatric Emergency Medication Safety Recommendations. The study further evaluates LLM confidence calibration and its ability to improve recommendation performance.</p><p><strong>Methods: </strong>We conducted a retrospective cohort study of older adults presenting to an ED in a large academic medical center in the Northeast United States from January 2022 to March 2022. A random sample of 100 patients (712 total oral medications) was selected for detailed analysis. The LLM pipeline consisted of two steps: (1) filtering high-yield deprescribing criteria based on patients' medication lists, and (2) applying these criteria using both structured and unstructured patient data to recommend deprescribing. Model performance was assessed by comparing model recommendations to those of trained medical students, with discrepancies adjudicated by board-certified ED physicians. Selective prediction, a method that allows a model to abstain from low-confidence predictions to improve overall reliability, was applied to assess the model's confidence and decision-making thresholds.</p><p><strong>Results: </strong>The LLM was significantly more effective in identifying deprescribing criteria (positive predictive value: 0.83; negative predictive value: 0.93; McNemar test for paired proportions: χ<sup>2</sup><sub>1</sub>=5.985; P=.02) relative to medical students, but showed limitations in making specific deprescribing recommendations (positive predictive value=0.47; negative predictive value=0.93). Adjudication revealed that while the model excelled at identifying when there was a deprescribing criterion related to one of the patient's medications, it often struggled with determining whether that criterion applied to the specific case due to complex inclusion and exclusion criteria (54.5% of errors) and ambiguous clinical contexts (eg, missing information; 39.3% of errors). Selective prediction only marginally improved LLM performance due to poorly calibrated confidence estimates.</p><p><strong>Conclusions: </strong>This study highlights the potential of LLMs to support deprescribing decisions in the ED by effectively filtering relevant criteria. However, challenges remain in applying these criteria to complex clinical scenarios, as the LLM demonstrated poor performance on more intricate decision-making tasks, with its reported confidence often failing to align with its actual success in these cases. The findings underscore the need for clearer deprescribing guidelines, improved LLM calibration for real-world use, and better integration of human-artificial intelligence workflows to balance artificial intelligence recommendations with clinician judgment.</p>","PeriodicalId":36245,"journal":{"name":"JMIR Aging","volume":"8 ","pages":"e69504"},"PeriodicalIF":5.0000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12032504/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Aging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/69504","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GERIATRICS & GERONTOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Polypharmacy, the concurrent use of multiple medications, is prevalent among older adults and associated with increased risks for adverse drug events including falls. Deprescribing, the systematic process of discontinuing potentially inappropriate medications, aims to mitigate these risks. However, the practical application of deprescribing criteria in emergency settings remains limited due to time constraints and criteria complexity.

Objective: This study aims to evaluate the performance of a large language model (LLM)-based pipeline in identifying deprescribing opportunities for older emergency department (ED) patients with polypharmacy, using 3 different sets of criteria: Beers, Screening Tool of Older People's Prescriptions, and Geriatric Emergency Medication Safety Recommendations. The study further evaluates LLM confidence calibration and its ability to improve recommendation performance.

Methods: We conducted a retrospective cohort study of older adults presenting to an ED in a large academic medical center in the Northeast United States from January 2022 to March 2022. A random sample of 100 patients (712 total oral medications) was selected for detailed analysis. The LLM pipeline consisted of two steps: (1) filtering high-yield deprescribing criteria based on patients' medication lists, and (2) applying these criteria using both structured and unstructured patient data to recommend deprescribing. Model performance was assessed by comparing model recommendations to those of trained medical students, with discrepancies adjudicated by board-certified ED physicians. Selective prediction, a method that allows a model to abstain from low-confidence predictions to improve overall reliability, was applied to assess the model's confidence and decision-making thresholds.

Results: The LLM was significantly more effective in identifying deprescribing criteria (positive predictive value: 0.83; negative predictive value: 0.93; McNemar test for paired proportions: χ21=5.985; P=.02) relative to medical students, but showed limitations in making specific deprescribing recommendations (positive predictive value=0.47; negative predictive value=0.93). Adjudication revealed that while the model excelled at identifying when there was a deprescribing criterion related to one of the patient's medications, it often struggled with determining whether that criterion applied to the specific case due to complex inclusion and exclusion criteria (54.5% of errors) and ambiguous clinical contexts (eg, missing information; 39.3% of errors). Selective prediction only marginally improved LLM performance due to poorly calibrated confidence estimates.

Conclusions: This study highlights the potential of LLMs to support deprescribing decisions in the ED by effectively filtering relevant criteria. However, challenges remain in applying these criteria to complex clinical scenarios, as the LLM demonstrated poor performance on more intricate decision-making tasks, with its reported confidence often failing to align with its actual success in these cases. The findings underscore the need for clearer deprescribing guidelines, improved LLM calibration for real-world use, and better integration of human-artificial intelligence workflows to balance artificial intelligence recommendations with clinician judgment.

在老年人中识别大语言模型的处方机会:回顾性队列研究。
背景:多重用药,即同时使用多种药物,在老年人中很普遍,并与包括跌倒在内的药物不良事件的风险增加有关。开处方,即停止使用可能不适当的药物的系统过程,旨在减轻这些风险。然而,由于时间限制和标准复杂性,在紧急情况下描述标准的实际应用仍然有限。目的:本研究旨在评估基于大语言模型(LLM)的管道在识别多药老年急诊科(ED)患者的处方机会方面的表现,使用3套不同的标准:Beers,老年人处方筛选工具和老年急诊药物安全建议。本研究进一步评估了LLM置信度校准及其提高推荐性能的能力。方法:我们对2022年1月至2022年3月期间在美国东北部一家大型学术医疗中心急诊科就诊的老年人进行了回顾性队列研究。随机抽取100例患者(共口服药物712种)进行详细分析。LLM流程包括两个步骤:(1)根据患者药物清单过滤高收益的处方标准,(2)使用结构化和非结构化患者数据应用这些标准来推荐处方。通过比较模型建议与训练有素的医学生的建议来评估模型的性能,差异由委员会认证的急诊科医生裁决。选择性预测是一种允许模型放弃低置信度预测以提高整体可靠性的方法,用于评估模型的置信度和决策阈值。结果:LLM在识别处方标准方面更有效(阳性预测值:0.83;阴性预测值:0.93;成对比例McNemar检验:χ21=5.985;P= 0.02),但在提出具体的处方建议方面存在局限性(阳性预测值=0.47;阴性预测值=0.93)。裁决显示,虽然该模型擅长于识别何时存在与患者的一种药物相关的处方标准,但由于复杂的纳入和排除标准(54.5%的错误)和模糊的临床背景(例如,信息缺失;39.3%的错误率)。选择性预测只能略微提高LLM的性能,因为校准的置信度估计很差。结论:本研究强调了llm通过有效过滤相关标准来支持ED处方决策的潜力。然而,将这些标准应用于复杂的临床场景仍然存在挑战,因为法学硕士在更复杂的决策任务中表现不佳,其报告的信心往往与这些案例中的实际成功不一致。研究结果强调需要更清晰的描述指南,改进LLM校准以适应现实世界的使用,以及更好地整合人类-人工智能工作流程,以平衡人工智能建议与临床医生的判断。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JMIR Aging
JMIR Aging Social Sciences-Health (social science)
CiteScore
6.50
自引率
4.10%
发文量
71
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信