{"title":"Generating Synthetic Electronic Health Record Data Using Generative Adversarial Networks: Tutorial.","authors":"Chao Yan, Ziqi Zhang, Steve Nyemba, Zhuohang Li","doi":"10.2196/52615","DOIUrl":null,"url":null,"abstract":"<p><p>Synthetic electronic health record (EHR) data generation has been increasingly recognized as an important solution to expand the accessibility and maximize the value of private health data on a large scale. Recent advances in machine learning have facilitated more accurate modeling for complex and high-dimensional data, thereby greatly enhancing the data quality of synthetic EHR data. Among various approaches, generative adversarial networks (GANs) have become the main technical path in the literature due to their ability to capture the statistical characteristics of real data. However, there is a scarcity of detailed guidance within the domain regarding the development procedures of synthetic EHR data. The objective of this tutorial is to present a transparent and reproducible process for generating structured synthetic EHR data using a publicly accessible EHR data set as an example. We cover the topics of GAN architecture, EHR data types and representation, data preprocessing, GAN training, synthetic data generation and postprocessing, and data quality evaluation. We conclude this tutorial by discussing multiple important issues and future opportunities in this domain. The source code of the entire process has been made publicly available.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e52615"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11074891/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/52615","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Synthetic electronic health record (EHR) data generation has been increasingly recognized as an important solution to expand the accessibility and maximize the value of private health data on a large scale. Recent advances in machine learning have facilitated more accurate modeling for complex and high-dimensional data, thereby greatly enhancing the data quality of synthetic EHR data. Among various approaches, generative adversarial networks (GANs) have become the main technical path in the literature due to their ability to capture the statistical characteristics of real data. However, there is a scarcity of detailed guidance within the domain regarding the development procedures of synthetic EHR data. The objective of this tutorial is to present a transparent and reproducible process for generating structured synthetic EHR data using a publicly accessible EHR data set as an example. We cover the topics of GAN architecture, EHR data types and representation, data preprocessing, GAN training, synthetic data generation and postprocessing, and data quality evaluation. We conclude this tutorial by discussing multiple important issues and future opportunities in this domain. The source code of the entire process has been made publicly available.
合成电子健康记录(EHR)数据生成已被越来越多的人认为是扩大私人健康数据的可访问性并最大限度地提高其价值的重要解决方案。机器学习的最新进展促进了对复杂和高维数据进行更精确的建模,从而大大提高了合成电子病历数据的质量。在各种方法中,生成对抗网络(GANs)因其能够捕捉真实数据的统计特征而成为文献中的主要技术路径。然而,在该领域中,有关合成电子病历数据开发程序的详细指导却十分匮乏。本教程的目的是以公开的电子病历数据集为例,介绍生成结构化合成电子病历数据的透明且可重复的流程。我们将讨论 GAN 架构、电子病历数据类型和表示、数据预处理、GAN 训练、合成数据生成和后处理以及数据质量评估等主题。最后,我们将讨论该领域的多个重要问题和未来机遇。整个过程的源代码已经公开。