Machine learning approaches for electronic health records phenotyping: A methodical review

Journal of the American Medical Informatics Association : JAMIA Pub Date : 2022-04-27 DOI:10.1101/2022.04.23.22274218

Siyue Yang, Paul Varghese, E. Stephenson, K. Tu, J. Gronsbell

{"title":"Machine learning approaches for electronic health records phenotyping: A methodical review","authors":"Siyue Yang, Paul Varghese, E. Stephenson, K. Tu, J. Gronsbell","doi":"10.1101/2022.04.23.22274218","DOIUrl":null,"url":null,"abstract":"Objective: Accurate and rapid methods for phenotyping are a prerequisite to realizing the potential of electronic health records (EHRs) data for clinical and translational research. This study reviews the literature on machine learning (ML) approaches for phenotyping with respect to the phenotypes considered, the data sources and methods used, and the contributions within the wider context of EHR-based research. Materials and Methods: We searched for relevant articles in PubMed and Web of Science published between January 1, 2018 and April 14, 2022. After screening, we collected data on 52 variables across 106 selected articles. Results: ML-based methods were developed for 156 unique phenotypes, primarily using EHR data from a single institution or health system. 72 of 106 articles leveraged unstructured data in clinical notes. In terms of methodology, supervised learning is the most prevalent ML paradigm (n = 64, 60.4%), with half of the articles employing deep learning. Semi-supervised and weakly-supervised approaches were applied to reduce the burden of obtaining gold-standard labeled data (n = 21, 19.8%), while unsupervised learning was used for phenotype discovery (n = 20, 18.9%). Federated learning has been applied to develop algorithms across multiple institutions while preserving data privacy (n = 2, 1.9%). Discussion While the use of ML for phenotyping is growing, most articles applied traditional supervised ML to characterize the presence of common, chronic conditions. Conclusion: Continued research in ML-based methods is warranted, with particular attention to the development of advanced methods for complex phenotypes and standards for reporting and evaluating phenotyping algorithms.","PeriodicalId":236137,"journal":{"name":"Journal of the American Medical Informatics Association : JAMIA","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association : JAMIA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2022.04.23.22274218","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Objective: Accurate and rapid methods for phenotyping are a prerequisite to realizing the potential of electronic health records (EHRs) data for clinical and translational research. This study reviews the literature on machine learning (ML) approaches for phenotyping with respect to the phenotypes considered, the data sources and methods used, and the contributions within the wider context of EHR-based research. Materials and Methods: We searched for relevant articles in PubMed and Web of Science published between January 1, 2018 and April 14, 2022. After screening, we collected data on 52 variables across 106 selected articles. Results: ML-based methods were developed for 156 unique phenotypes, primarily using EHR data from a single institution or health system. 72 of 106 articles leveraged unstructured data in clinical notes. In terms of methodology, supervised learning is the most prevalent ML paradigm (n = 64, 60.4%), with half of the articles employing deep learning. Semi-supervised and weakly-supervised approaches were applied to reduce the burden of obtaining gold-standard labeled data (n = 21, 19.8%), while unsupervised learning was used for phenotype discovery (n = 20, 18.9%). Federated learning has been applied to develop algorithms across multiple institutions while preserving data privacy (n = 2, 1.9%). Discussion While the use of ML for phenotyping is growing, most articles applied traditional supervised ML to characterize the presence of common, chronic conditions. Conclusion: Continued research in ML-based methods is warranted, with particular attention to the development of advanced methods for complex phenotypes and standards for reporting and evaluating phenotyping algorithms.

查看原文本刊更多论文

电子健康记录表型的机器学习方法:系统回顾

目的:准确和快速的表型方法是实现电子健康记录(EHRs)数据在临床和转化研究中的潜力的先决条件。本研究回顾了机器学习(ML)表型方法的文献，涉及所考虑的表型，使用的数据源和方法，以及在基于ehr的研究的更广泛背景下的贡献。材料与方法:我们检索了2018年1月1日至2022年4月14日期间在PubMed和Web of Science上发表的相关文章。筛选后，我们收集了106篇文章中52个变量的数据。结果:基于ml的方法开发了156种独特的表型，主要使用来自单一机构或卫生系统的电子病历数据。106篇文章中有72篇利用了临床记录中的非结构化数据。在方法学方面，监督学习是最普遍的机器学习范式(n = 64,60.4%)，其中一半的文章采用深度学习。采用半监督和弱监督方法来减轻获得金标准标记数据的负担(n = 21, 19.8%)，而无监督学习用于表型发现(n = 20, 18.9%)。联邦学习已被应用于开发跨多个机构的算法，同时保护数据隐私(n = 2,1.9%)。虽然ML用于表型分型的使用正在增长，但大多数文章应用传统的监督ML来表征常见慢性疾病的存在。结论:继续研究基于ml的方法是有必要的，特别要注意复杂表型的先进方法的发展和表型算法的报告和评估标准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the American Medical Informatics Association : JAMIA

自引率

0.00%

发文量