Automated Extraction of Stroke Severity From Unstructured Electronic Health Records Using Natural Language Processing.

IF 5 1区 医学 Q1 CARDIAC & CARDIOVASCULAR SYSTEMS
Journal of the American Heart Association Pub Date : 2024-11-05 Epub Date: 2024-10-25 DOI:10.1161/JAHA.124.036386
Marta Fernandes, M Brandon Westover, Aneesh B Singhal, Sahar F Zafar
{"title":"Automated Extraction of Stroke Severity From Unstructured Electronic Health Records Using Natural Language Processing.","authors":"Marta Fernandes, M Brandon Westover, Aneesh B Singhal, Sahar F Zafar","doi":"10.1161/JAHA.124.036386","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Multicenter electronic health records can support quality improvement and comparative effectiveness research in stroke. However, limitations of electronic health record-based research include challenges in abstracting key clinical variables, including stroke severity, along with missing data. We developed a natural language processing model that reads electronic health record notes to directly extract the National Institutes of Health Stroke Scale score when documented and predict the score from clinical documentation when missing.</p><p><strong>Methods and results: </strong>The study included notes from patients with acute stroke (aged ≥18 years) admitted to Massachusetts General Hospital (2015-2022). The Massachusetts General Hospital data were divided into training/holdout test (70%/30%) sets. We developed a 2-stage model to predict the admission National Institutes of Health Stroke Scale, obtained from the GWTG (Get With The Guidelines) stroke registry. We trained a model with the least absolute shrinkage and selection operator. For test notes with documented National Institutes of Health Stroke Scale, scores were extracted using regular expressions (stage 1); when not documented, least absolute shrinkage and selection operator was used for prediction (stage 2). The 2-stage model was tested on the holdout test set and validated in the Medical Information Mart for Intensive Care (2001-2012) version 1.4, using root mean squared error and Spearman correlation. We included 4163 patients (Massachusetts General Hospital, 3876; Medical Information Mart for Intensive Care, 287); average age, 69 (SD, 15) years; 53% men, and 72% White individuals. The model achieved a root mean squared error of 2.89 (95% CI, 2.62-3.19) and Spearman correlation of 0.92 (95% CI, 0.91-0.93) in the Massachusetts General Hospital test set, and 2.20 (95% CI, 1.69-2.66) and 0.96 (95% CI, 0.94-0.97) in the MIMIC validation set, respectively.</p><p><strong>Conclusions: </strong>The automatic natural language processing-based model can enable large-scale stroke severity phenotyping from the electronic health record and support real-world quality improvement and comparative effectiveness studies in stroke.</p>","PeriodicalId":54370,"journal":{"name":"Journal of the American Heart Association","volume":" ","pages":"e036386"},"PeriodicalIF":5.0000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Heart Association","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1161/JAHA.124.036386","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/25 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Multicenter electronic health records can support quality improvement and comparative effectiveness research in stroke. However, limitations of electronic health record-based research include challenges in abstracting key clinical variables, including stroke severity, along with missing data. We developed a natural language processing model that reads electronic health record notes to directly extract the National Institutes of Health Stroke Scale score when documented and predict the score from clinical documentation when missing.

Methods and results: The study included notes from patients with acute stroke (aged ≥18 years) admitted to Massachusetts General Hospital (2015-2022). The Massachusetts General Hospital data were divided into training/holdout test (70%/30%) sets. We developed a 2-stage model to predict the admission National Institutes of Health Stroke Scale, obtained from the GWTG (Get With The Guidelines) stroke registry. We trained a model with the least absolute shrinkage and selection operator. For test notes with documented National Institutes of Health Stroke Scale, scores were extracted using regular expressions (stage 1); when not documented, least absolute shrinkage and selection operator was used for prediction (stage 2). The 2-stage model was tested on the holdout test set and validated in the Medical Information Mart for Intensive Care (2001-2012) version 1.4, using root mean squared error and Spearman correlation. We included 4163 patients (Massachusetts General Hospital, 3876; Medical Information Mart for Intensive Care, 287); average age, 69 (SD, 15) years; 53% men, and 72% White individuals. The model achieved a root mean squared error of 2.89 (95% CI, 2.62-3.19) and Spearman correlation of 0.92 (95% CI, 0.91-0.93) in the Massachusetts General Hospital test set, and 2.20 (95% CI, 1.69-2.66) and 0.96 (95% CI, 0.94-0.97) in the MIMIC validation set, respectively.

Conclusions: The automatic natural language processing-based model can enable large-scale stroke severity phenotyping from the electronic health record and support real-world quality improvement and comparative effectiveness studies in stroke.

利用自然语言处理技术从非结构化电子健康记录中自动提取中风严重程度。
背景:多中心电子病历可支持中风的质量改进和比较有效性研究。然而,基于电子健康记录的研究存在局限性,包括在抽取关键临床变量(包括卒中严重程度)时遇到困难,以及数据缺失。我们开发了一种自然语言处理模型,它能读取电子健康记录笔记,在有记录时直接提取美国国立卫生研究院卒中量表评分,在缺失时从临床记录中预测评分:研究纳入了麻省总医院收治的急性卒中患者(年龄≥18 岁)的病历(2015-2022 年)。马萨诸塞州综合医院的数据被分为训练/淘汰测试(70%/30%)集。我们开发了一个两阶段模型来预测入院时的美国国立卫生研究院卒中量表,该量表来自 GWTG(Get With The Guidelines)卒中登记册。我们用最小绝对缩减和选择算子训练了一个模型。对于有美国国立卫生研究院卒中量表记录的测试记录,使用正则表达式提取分数(第 1 阶段);如果没有记录,则使用最小绝对收缩和选择算子进行预测(第 2 阶段)。使用均方根误差和斯皮尔曼相关性对 2 阶段模型进行了测试,并在重症监护医学信息市场(2001-2012)1.4 版中进行了验证。我们纳入了 4163 名患者(麻省总医院,3876 人;重症监护医学信息市场,287 人);平均年龄 69(SD,15)岁;53% 为男性,72% 为白人。在麻省总医院测试集中,该模型的均方根误差为 2.89(95% CI,2.62-3.19),斯皮尔曼相关性为 0.92(95% CI,0.91-0.93);在 MIMIC 验证集中,该模型的均方根误差为 2.20(95% CI,1.69-2.66),斯皮尔曼相关性为 0.96(95% CI,0.94-0.97):基于自然语言处理的自动模型可通过电子健康记录实现大规模卒中严重程度表型分析,并支持真实世界的卒中质量改进和比较效果研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of the American Heart Association
Journal of the American Heart Association CARDIAC & CARDIOVASCULAR SYSTEMS-
CiteScore
9.40
自引率
1.90%
发文量
1749
审稿时长
12 weeks
期刊介绍: As an Open Access journal, JAHA - Journal of the American Heart Association is rapidly and freely available, accelerating the translation of strong science into effective practice. JAHA is an authoritative, peer-reviewed Open Access journal focusing on cardiovascular and cerebrovascular disease. JAHA provides a global forum for basic and clinical research and timely reviews on cardiovascular disease and stroke. As an Open Access journal, its content is free on publication to read, download, and share, accelerating the translation of strong science into effective practice.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信