Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods.

IF 4 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2020-12-07 DOI:10.1186/s13040-020-00230-x

Phyllis M Thangaraj, Benjamin R Kummer, Tal Lorberbaum, Mitchell S V Elkind, Nicholas P Tatonetti

{"title":"Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods.","authors":"Phyllis M Thangaraj, Benjamin R Kummer, Tal Lorberbaum, Mitchell S V Elkind, Nicholas P Tatonetti","doi":"10.1186/s13040-020-00230-x","DOIUrl":null,"url":null,"abstract":"Background: Accurate identification of acute ischemic stroke (AIS) patient cohorts is essential for a wide range of clinical investigations. Automated phenotyping methods that leverage electronic health records (EHRs) represent a fundamentally new approach cohort identification without current laborious and ungeneralizable generation of phenotyping algorithms. We systematically compared and evaluated the ability of machine learning algorithms and case-control combinations to phenotype acute ischemic stroke patients using data from an EHR.Materials and methods: Using structured patient data from the EHR at a tertiary-care hospital system, we built and evaluated machine learning models to identify patients with AIS based on 75 different case-control and classifier combinations. We then estimated the prevalence of AIS patients across the EHR. Finally, we externally validated the ability of the models to detect AIS patients without AIS diagnosis codes using the UK Biobank.Results: Across all models, we found that the mean AUROC for detecting AIS was 0.963 ± 0.0520 and average precision score 0.790 ± 0.196 with minimal feature processing. Classifiers trained with cases with AIS diagnosis codes and controls with no cerebrovascular disease codes had the best average F1 score (0.832 ± 0.0383). In the external validation, we found that the top probabilities from a model-predicted AIS cohort were significantly enriched for AIS patients without AIS diagnosis codes (60-150 fold over expected).Conclusions: Our findings support machine learning algorithms as a generalizable way to accurately identify AIS patients without using process-intensive manual feature curation. When a set of AIS patients is unavailable, diagnosis codes may be used to train classifier models.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"13 1","pages":"21"},"PeriodicalIF":4.0000,"publicationDate":"2020-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7720570/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-020-00230-x","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Accurate identification of acute ischemic stroke (AIS) patient cohorts is essential for a wide range of clinical investigations. Automated phenotyping methods that leverage electronic health records (EHRs) represent a fundamentally new approach cohort identification without current laborious and ungeneralizable generation of phenotyping algorithms. We systematically compared and evaluated the ability of machine learning algorithms and case-control combinations to phenotype acute ischemic stroke patients using data from an EHR.

Materials and methods: Using structured patient data from the EHR at a tertiary-care hospital system, we built and evaluated machine learning models to identify patients with AIS based on 75 different case-control and classifier combinations. We then estimated the prevalence of AIS patients across the EHR. Finally, we externally validated the ability of the models to detect AIS patients without AIS diagnosis codes using the UK Biobank.

Results: Across all models, we found that the mean AUROC for detecting AIS was 0.963 ± 0.0520 and average precision score 0.790 ± 0.196 with minimal feature processing. Classifiers trained with cases with AIS diagnosis codes and controls with no cerebrovascular disease codes had the best average F1 score (0.832 ± 0.0383). In the external validation, we found that the top probabilities from a model-predicted AIS cohort were significantly enriched for AIS patients without AIS diagnosis codes (60-150 fold over expected).

Conclusions: Our findings support machine learning algorithms as a generalizable way to accurately identify AIS patients without using process-intensive manual feature curation. When a set of AIS patients is unavailable, diagnosis codes may be used to train classifier models.

Abstract Image

查看原文本刊更多论文

基于电子健康记录的中风表型方法的比较分析、应用和解释。

背景：准确识别急性缺血性脑卒中（AIS）患者队列对各种临床研究至关重要。利用电子健康记录（EHR）的自动表型方法是一种全新的队列识别方法，而无需目前费力且无法通用的表型算法。我们系统地比较和评估了机器学习算法和病例对照组合使用电子病历数据对急性缺血性中风患者进行表型的能力：利用一家三级医院系统的电子病历中的结构化患者数据，我们建立并评估了机器学习模型，该模型基于 75 种不同的病例对照和分类器组合来识别急性缺血性卒中患者。然后，我们估算了 EHR 中 AIS 患者的患病率。最后，我们利用英国生物库从外部验证了这些模型检测无 AIS 诊断代码的 AIS 患者的能力：结果：我们发现，在所有模型中，检测 AIS 的平均 AUROC 为 0.963 ± 0.0520，平均精确度为 0.790 ± 0.196，特征处理最小。用带有 AIS 诊断代码的病例和没有脑血管疾病代码的对照组训练的分类器平均 F1 得分最高（0.832 ± 0.0383）。在外部验证中，我们发现模型预测的 AIS 队列的最高概率显著提高了无 AIS 诊断代码的 AIS 患者的概率（比预期高出 60-150 倍）：我们的研究结果支持将机器学习算法作为一种通用方法，在不使用过程密集型人工特征整理的情况下准确识别 AIS 患者。当没有一组 AIS 患者时，诊断代码可用于训练分类器模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.