通过更好的培训、建模和评估确保电子病历模拟

Journal of the American Medical Informatics Association : JAMIA Pub Date : 2019-10-08 DOI:10.1093/jamia/ocz161

Ziqi Zhang, Chao Yan, Diego A. Mesa, Jimeng Sun, B. Malin

{"title":"通过更好的培训、建模和评估确保电子病历模拟","authors":"Ziqi Zhang, Chao Yan, Diego A. Mesa, Jimeng Sun, B. Malin","doi":"10.1093/jamia/ocz161","DOIUrl":null,"url":null,"abstract":"OBJECTIVE\nElectronic medical records (EMRs) can support medical research and discovery, but privacy risks limit the sharing of such data on a wide scale. Various approaches have been developed to mitigate risk, including record simulation via generative adversarial networks (GANs). While showing promise in certain application domains, GANs lack a principled approach for EMR data that induces subpar simulation. In this article, we improve EMR simulation through a novel pipeline that (1) enhances the learning model, (2) incorporates evaluation criteria for data utility that informs learning, and (3) refines the training process.\n\n\nMATERIALS AND METHODS\nWe propose a new electronic health record generator using a GAN with a Wasserstein divergence and layer normalization techniques. We designed 2 utility measures to characterize similarity in the structural properties of real and simulated EMRs in the original and latent space, respectively. We applied a filtering strategy to enhance GAN training for low-prevalence clinical concepts. We evaluated the new and existing GANs with utility and privacy measures (membership and disclosure attacks) using billing codes from over 1 million EMRs at Vanderbilt University Medical Center.\n\n\nRESULTS\nThe proposed model outperformed the state-of-the-art approaches with significant improvement in retaining the nature of real records, including prediction performance and structural properties, without sacrificing privacy. Additionally, the filtering strategy achieved higher utility when the EMR training dataset was small.\n\n\nCONCLUSIONS\nThese findings illustrate that EMR simulation through GANs can be substantially improved through more appropriate training, modeling, and evaluation criteria.","PeriodicalId":236137,"journal":{"name":"Journal of the American Medical Informatics Association : JAMIA","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":"{\"title\":\"Ensuring electronic medical record simulation through better training, modeling, and evaluation\",\"authors\":\"Ziqi Zhang, Chao Yan, Diego A. Mesa, Jimeng Sun, B. Malin\",\"doi\":\"10.1093/jamia/ocz161\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"OBJECTIVE\\nElectronic medical records (EMRs) can support medical research and discovery, but privacy risks limit the sharing of such data on a wide scale. Various approaches have been developed to mitigate risk, including record simulation via generative adversarial networks (GANs). While showing promise in certain application domains, GANs lack a principled approach for EMR data that induces subpar simulation. In this article, we improve EMR simulation through a novel pipeline that (1) enhances the learning model, (2) incorporates evaluation criteria for data utility that informs learning, and (3) refines the training process.\\n\\n\\nMATERIALS AND METHODS\\nWe propose a new electronic health record generator using a GAN with a Wasserstein divergence and layer normalization techniques. We designed 2 utility measures to characterize similarity in the structural properties of real and simulated EMRs in the original and latent space, respectively. We applied a filtering strategy to enhance GAN training for low-prevalence clinical concepts. We evaluated the new and existing GANs with utility and privacy measures (membership and disclosure attacks) using billing codes from over 1 million EMRs at Vanderbilt University Medical Center.\\n\\n\\nRESULTS\\nThe proposed model outperformed the state-of-the-art approaches with significant improvement in retaining the nature of real records, including prediction performance and structural properties, without sacrificing privacy. Additionally, the filtering strategy achieved higher utility when the EMR training dataset was small.\\n\\n\\nCONCLUSIONS\\nThese findings illustrate that EMR simulation through GANs can be substantially improved through more appropriate training, modeling, and evaluation criteria.\",\"PeriodicalId\":236137,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association : JAMIA\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"47\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association : JAMIA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jamia/ocz161\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association : JAMIA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamia/ocz161","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

摘要

目的电子病历(EMRs)可以支持医学研究和发现，但隐私风险限制了此类数据的大范围共享。已经开发了各种方法来降低风险，包括通过生成对抗网络(gan)进行记录模拟。虽然gan在某些应用领域显示出前景，但它缺乏一种原则性的方法来处理EMR数据，从而导致模拟低于标准。在本文中，我们通过一个新的管道改进EMR模拟，该管道(1)增强学习模型，(2)结合数据效用的评估标准，为学习提供信息，以及(3)改进训练过程。材料与方法我们提出了一种新的电子健康记录生成器，使用GAN与Wasserstein散度和层归一化技术。我们设计了2个效用度量，分别表征真实和模拟emr在原始空间和潜在空间中结构特性的相似性。我们应用了一种过滤策略来增强低流行率临床概念的GAN训练。我们使用范德比尔特大学医学中心超过100万个电子病历的账单代码，对新的和现有的gan进行了效用和隐私措施(成员资格和披露攻击)的评估。结果提出的模型在保留真实记录的性质(包括预测性能和结构属性)方面优于最先进的方法，同时不牺牲隐私。此外，当EMR训练数据集较小时，该过滤策略具有更高的效用。结论这些发现表明，通过更合适的训练、建模和评估标准，可以大大改善gan的EMR模拟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Ensuring electronic medical record simulation through better training, modeling, and evaluation

OBJECTIVE Electronic medical records (EMRs) can support medical research and discovery, but privacy risks limit the sharing of such data on a wide scale. Various approaches have been developed to mitigate risk, including record simulation via generative adversarial networks (GANs). While showing promise in certain application domains, GANs lack a principled approach for EMR data that induces subpar simulation. In this article, we improve EMR simulation through a novel pipeline that (1) enhances the learning model, (2) incorporates evaluation criteria for data utility that informs learning, and (3) refines the training process. MATERIALS AND METHODS We propose a new electronic health record generator using a GAN with a Wasserstein divergence and layer normalization techniques. We designed 2 utility measures to characterize similarity in the structural properties of real and simulated EMRs in the original and latent space, respectively. We applied a filtering strategy to enhance GAN training for low-prevalence clinical concepts. We evaluated the new and existing GANs with utility and privacy measures (membership and disclosure attacks) using billing codes from over 1 million EMRs at Vanderbilt University Medical Center. RESULTS The proposed model outperformed the state-of-the-art approaches with significant improvement in retaining the nature of real records, including prediction performance and structural properties, without sacrificing privacy. Additionally, the filtering strategy achieved higher utility when the EMR training dataset was small. CONCLUSIONS These findings illustrate that EMR simulation through GANs can be substantially improved through more appropriate training, modeling, and evaluation criteria.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the American Medical Informatics Association : JAMIA

自引率

0.00%

发文量