我们是否应该合成比我们需要的更多：合成数据生成对高维横断面医疗数据的影响。

IF 4.6 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of the American Medical Informatics Association Pub Date : 2025-10-10 DOI:10.1093/jamia/ocaf169

Lisa Pilgram, Samer El Kababji, Dan Liu, Khaled El Emam

{"title":"我们是否应该合成比我们需要的更多：合成数据生成对高维横断面医疗数据的影响。","authors":"Lisa Pilgram, Samer El Kababji, Dan Liu, Khaled El Emam","doi":"10.1093/jamia/ocaf169","DOIUrl":null,"url":null,"abstract":"Objective: In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD).Material and methods: We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component.Results: Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables.Discussion: Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset.Conclusion: Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Should we synthesize more than we need: impact of synthetic data generation for high-dimensional cross-sectional medical data.\",\"authors\":\"Lisa Pilgram, Samer El Kababji, Dan Liu, Khaled El Emam\",\"doi\":\"10.1093/jamia/ocaf169\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD).Material and methods: We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component.Results: Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables.Discussion: Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset.Conclusion: Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.\",\"PeriodicalId\":50016,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1093/jamia/ocaf169\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf169","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

目的：在医学研究和教育中，利用生成式人工智能/机器学习（AI/ML）模型合成人工医疗数据，可以在保护患者隐私的同时实现高质量数据的共享。考虑到这些数据通常是高维的，一个相关的考虑是，当只需要一个任务相关的子集时，是否要合成整个数据集。本研究评估了训练中变量的数量如何影响合成数据（SD）的保真度、效用和隐私性。材料和方法：我们使用了12个横断面医学数据集，定义了具有相应核心变量的下游任务，并通过在核心中添加辅助变量衍生出6354个变体。使用7种不同的生成模型生成SD，并对保真度、下游效用和隐私性进行评估。混合效应模型用于评估辅助变量对各自评价指标的影响，将医疗数据集作为随机组成部分。结果：5/7个SDG模型中辅助变量的数量不影响保真度。同样，在6/7（预测任务）和5/7（推理任务）SDG模型中，下游效用保持稳定。当观察到显著的影响时，它们是最小的，例如，当增加120个变量时，受试者工作特征曲线下的面积（AUROC）减少0.05。隐私不受附加变量数量的影响。讨论：我们的研究结果表明，当生成比任务相关子集更全面的医疗数据集时，保真度、实用性和隐私性得到了保护。结论：我们的研究结果支持在医学研究和教育中实施可持续发展目标的一种具有成本效益、实用性和隐私保护的方式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Should we synthesize more than we need: impact of synthetic data generation for high-dimensional cross-sectional medical data.

Objective: In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD).

Material and methods: We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component.

Results: Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables.

Discussion: Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset.

Conclusion: Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.