基于自然语言处理的住院电子病历数据中脑血管病病例识别。

IF 4.5 Q1 Computer Science
Jie Pan, Zilong Zhang, Steven Ray Peters, Shabnam Vatanpour, Robin L Walker, Seungwon Lee, Elliot A Martin, Hude Quan
{"title":"基于自然语言处理的住院电子病历数据中脑血管病病例识别。","authors":"Jie Pan, Zilong Zhang, Steven Ray Peters, Shabnam Vatanpour, Robin L Walker, Seungwon Lee, Elliot A Martin, Hude Quan","doi":"10.1186/s40708-023-00203-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Abstracting cerebrovascular disease (CeVD) from inpatient electronic medical records (EMRs) through natural language processing (NLP) is pivotal for automated disease surveillance and improving patient outcomes. Existing methods rely on coders' abstraction, which has time delays and under-coding issues. This study sought to develop an NLP-based method to detect CeVD using EMR clinical notes.</p><p><strong>Methods: </strong>CeVD status was confirmed through a chart review on randomly selected hospitalized patients who were 18 years or older and discharged from 3 hospitals in Calgary, Alberta, Canada, between January 1 and June 30, 2015. These patients' chart data were linked to administrative discharge abstract database (DAD) and Sunrise<sup>™</sup> Clinical Manager (SCM) EMR database records by Personal Health Number (a unique lifetime identifier) and admission date. We trained multiple natural language processing (NLP) predictive models by combining two clinical concept extraction methods and two supervised machine learning (ML) methods: random forest and XGBoost. Using chart review as the reference standard, we compared the model performances with those of the commonly applied International Classification of Diseases (ICD-10-CA) codes, on the metrics of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).</p><p><strong>Result: </strong>Of the study sample (n = 3036), the prevalence of CeVD was 11.8% (n = 360); the median patient age was 63; and females accounted for 50.3% (n = 1528) based on chart data. Among 49 extracted clinical documents from the EMR, four document types were identified as the most influential text sources for identifying CeVD disease (\"nursing transfer report,\" \"discharge summary,\" \"nursing notes,\" and \"inpatient consultation.\"). The best performing NLP model was XGBoost, combining the Unified Medical Language System concepts extracted by cTAKES (e.g., top-ranked concepts, \"Cerebrovascular accident\" and \"Transient ischemic attack\"), and the term frequency-inverse document frequency vectorizer. Compared with ICD codes, the model achieved higher validity overall, such as sensitivity (25.0% vs 70.0%), specificity (99.3% vs 99.1%), PPV (82.6 vs. 87.8%), and NPV (90.8% vs 97.1%).</p><p><strong>Conclusion: </strong>The NLP algorithm developed in this study performed better than the ICD code algorithm in detecting CeVD. The NLP models could result in an automated EMR tool for identifying CeVD cases and be applied for future studies such as surveillance, and longitudinal studies.</p>","PeriodicalId":37465,"journal":{"name":"Brain Informatics","volume":"10 1","pages":"22"},"PeriodicalIF":4.5000,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10474977/pdf/","citationCount":"0","resultStr":"{\"title\":\"Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing.\",\"authors\":\"Jie Pan, Zilong Zhang, Steven Ray Peters, Shabnam Vatanpour, Robin L Walker, Seungwon Lee, Elliot A Martin, Hude Quan\",\"doi\":\"10.1186/s40708-023-00203-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Abstracting cerebrovascular disease (CeVD) from inpatient electronic medical records (EMRs) through natural language processing (NLP) is pivotal for automated disease surveillance and improving patient outcomes. Existing methods rely on coders' abstraction, which has time delays and under-coding issues. This study sought to develop an NLP-based method to detect CeVD using EMR clinical notes.</p><p><strong>Methods: </strong>CeVD status was confirmed through a chart review on randomly selected hospitalized patients who were 18 years or older and discharged from 3 hospitals in Calgary, Alberta, Canada, between January 1 and June 30, 2015. These patients' chart data were linked to administrative discharge abstract database (DAD) and Sunrise<sup>™</sup> Clinical Manager (SCM) EMR database records by Personal Health Number (a unique lifetime identifier) and admission date. We trained multiple natural language processing (NLP) predictive models by combining two clinical concept extraction methods and two supervised machine learning (ML) methods: random forest and XGBoost. Using chart review as the reference standard, we compared the model performances with those of the commonly applied International Classification of Diseases (ICD-10-CA) codes, on the metrics of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).</p><p><strong>Result: </strong>Of the study sample (n = 3036), the prevalence of CeVD was 11.8% (n = 360); the median patient age was 63; and females accounted for 50.3% (n = 1528) based on chart data. Among 49 extracted clinical documents from the EMR, four document types were identified as the most influential text sources for identifying CeVD disease (\\\"nursing transfer report,\\\" \\\"discharge summary,\\\" \\\"nursing notes,\\\" and \\\"inpatient consultation.\\\"). The best performing NLP model was XGBoost, combining the Unified Medical Language System concepts extracted by cTAKES (e.g., top-ranked concepts, \\\"Cerebrovascular accident\\\" and \\\"Transient ischemic attack\\\"), and the term frequency-inverse document frequency vectorizer. Compared with ICD codes, the model achieved higher validity overall, such as sensitivity (25.0% vs 70.0%), specificity (99.3% vs 99.1%), PPV (82.6 vs. 87.8%), and NPV (90.8% vs 97.1%).</p><p><strong>Conclusion: </strong>The NLP algorithm developed in this study performed better than the ICD code algorithm in detecting CeVD. The NLP models could result in an automated EMR tool for identifying CeVD cases and be applied for future studies such as surveillance, and longitudinal studies.</p>\",\"PeriodicalId\":37465,\"journal\":{\"name\":\"Brain Informatics\",\"volume\":\"10 1\",\"pages\":\"22\"},\"PeriodicalIF\":4.5000,\"publicationDate\":\"2023-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10474977/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Brain Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1186/s40708-023-00203-w\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Brain Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s40708-023-00203-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

摘要

背景:通过自然语言处理(NLP)从住院患者电子病历(emr)中提取脑血管疾病(CeVD)是疾病自动化监测和改善患者预后的关键。现有的方法依赖于编码人员的抽象,存在时间延迟和编码不足的问题。本研究旨在开发一种基于nlp的方法,利用EMR临床记录检测CeVD。方法:对2015年1月1日至6月30日在加拿大阿尔伯塔省卡尔加里市3家医院随机抽取的18岁及以上住院出院患者进行病例回顾,确认CeVD情况。这些患者的病历数据通过个人健康号码(一个唯一的终生标识符)和入院日期与行政出院摘要数据库(DAD)和Sunrise™临床管理器(SCM) EMR数据库记录相关联。我们结合随机森林和XGBoost两种临床概念提取方法和两种监督机器学习(ML)方法训练了多个自然语言处理(NLP)预测模型。以图表回顾为参考标准,将该模型与常用的国际疾病分类(ICD-10-CA)代码在敏感性、特异性、阳性预测值(PPV)和阴性预测值(NPV)等指标上的性能进行比较。结果:3036例研究样本中,CeVD患病率为11.8%(360例);患者年龄中位数为63岁;根据图表数据,女性占50.3% (n = 1528)。在从EMR中提取的49份临床文件中,四种文件类型被确定为识别CeVD疾病最具影响力的文本来源(“护理转院报告”、“出院总结”、“护理笔记”和“住院会诊”)。表现最好的NLP模型是XGBoost,它结合了ctake提取的统一医学语言系统概念(例如排名最高的概念,“脑血管事故”和“短暂性脑缺血发作”)和术语频率逆的文档频率矢量器。与ICD编码相比,该模型的灵敏度(25.0%比70.0%)、特异性(99.3%比99.1%)、PPV(82.6比87.8%)和NPV(90.8%比97.1%)总体效度更高。结论:NLP算法对CeVD的检测效果优于ICD编码算法。NLP模型可以产生用于识别CeVD病例的自动化电子病历工具,并应用于未来的研究,如监测和纵向研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing.

Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing.

Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing.

Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing.

Background: Abstracting cerebrovascular disease (CeVD) from inpatient electronic medical records (EMRs) through natural language processing (NLP) is pivotal for automated disease surveillance and improving patient outcomes. Existing methods rely on coders' abstraction, which has time delays and under-coding issues. This study sought to develop an NLP-based method to detect CeVD using EMR clinical notes.

Methods: CeVD status was confirmed through a chart review on randomly selected hospitalized patients who were 18 years or older and discharged from 3 hospitals in Calgary, Alberta, Canada, between January 1 and June 30, 2015. These patients' chart data were linked to administrative discharge abstract database (DAD) and Sunrise Clinical Manager (SCM) EMR database records by Personal Health Number (a unique lifetime identifier) and admission date. We trained multiple natural language processing (NLP) predictive models by combining two clinical concept extraction methods and two supervised machine learning (ML) methods: random forest and XGBoost. Using chart review as the reference standard, we compared the model performances with those of the commonly applied International Classification of Diseases (ICD-10-CA) codes, on the metrics of sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).

Result: Of the study sample (n = 3036), the prevalence of CeVD was 11.8% (n = 360); the median patient age was 63; and females accounted for 50.3% (n = 1528) based on chart data. Among 49 extracted clinical documents from the EMR, four document types were identified as the most influential text sources for identifying CeVD disease ("nursing transfer report," "discharge summary," "nursing notes," and "inpatient consultation."). The best performing NLP model was XGBoost, combining the Unified Medical Language System concepts extracted by cTAKES (e.g., top-ranked concepts, "Cerebrovascular accident" and "Transient ischemic attack"), and the term frequency-inverse document frequency vectorizer. Compared with ICD codes, the model achieved higher validity overall, such as sensitivity (25.0% vs 70.0%), specificity (99.3% vs 99.1%), PPV (82.6 vs. 87.8%), and NPV (90.8% vs 97.1%).

Conclusion: The NLP algorithm developed in this study performed better than the ICD code algorithm in detecting CeVD. The NLP models could result in an automated EMR tool for identifying CeVD cases and be applied for future studies such as surveillance, and longitudinal studies.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Brain Informatics
Brain Informatics Computer Science-Computer Science Applications
CiteScore
9.50
自引率
0.00%
发文量
27
审稿时长
13 weeks
期刊介绍: Brain Informatics is an international, peer-reviewed, interdisciplinary open-access journal published under the brand SpringerOpen, which provides a unique platform for researchers and practitioners to disseminate original research on computational and informatics technologies related to brain. This journal addresses the computational, cognitive, physiological, biological, physical, ecological and social perspectives of brain informatics. It also welcomes emerging information technologies and advanced neuro-imaging technologies, such as big data analytics and interactive knowledge discovery related to various large-scale brain studies and their applications. This journal will publish high-quality original research papers, brief reports and critical reviews in all theoretical, technological, clinical and interdisciplinary studies that make up the field of brain informatics and its applications in brain-machine intelligence, brain-inspired intelligent systems, mental health and brain disorders, etc. The scope of papers includes the following five tracks: Track 1: Cognitive and Computational Foundations of Brain Science Track 2: Human Information Processing Systems Track 3: Brain Big Data Analytics, Curation and Management Track 4: Informatics Paradigms for Brain and Mental Health Research Track 5: Brain-Machine Intelligence and Brain-Inspired Computing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信