我们是否应该合成比我们需要的更多:合成数据生成对高维横断面医疗数据的影响。

IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Lisa Pilgram, Samer El Kababji, Dan Liu, Khaled El Emam
{"title":"我们是否应该合成比我们需要的更多:合成数据生成对高维横断面医疗数据的影响。","authors":"Lisa Pilgram, Samer El Kababji, Dan Liu, Khaled El Emam","doi":"10.1093/jamia/ocaf169","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD).</p><p><strong>Material and methods: </strong>We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component.</p><p><strong>Results: </strong>Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables.</p><p><strong>Discussion: </strong>Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset.</p><p><strong>Conclusion: </strong>Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Should we synthesize more than we need: impact of synthetic data generation for high-dimensional cross-sectional medical data.\",\"authors\":\"Lisa Pilgram, Samer El Kababji, Dan Liu, Khaled El Emam\",\"doi\":\"10.1093/jamia/ocaf169\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD).</p><p><strong>Material and methods: </strong>We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component.</p><p><strong>Results: </strong>Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables.</p><p><strong>Discussion: </strong>Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset.</p><p><strong>Conclusion: </strong>Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.</p>\",\"PeriodicalId\":50016,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1093/jamia/ocaf169\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf169","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

目的:在医学研究和教育中,利用生成式人工智能/机器学习(AI/ML)模型合成人工医疗数据,可以在保护患者隐私的同时实现高质量数据的共享。考虑到这些数据通常是高维的,一个相关的考虑是,当只需要一个任务相关的子集时,是否要合成整个数据集。本研究评估了训练中变量的数量如何影响合成数据(SD)的保真度、效用和隐私性。材料和方法:我们使用了12个横断面医学数据集,定义了具有相应核心变量的下游任务,并通过在核心中添加辅助变量衍生出6354个变体。使用7种不同的生成模型生成SD,并对保真度、下游效用和隐私性进行评估。混合效应模型用于评估辅助变量对各自评价指标的影响,将医疗数据集作为随机组成部分。结果:5/7个SDG模型中辅助变量的数量不影响保真度。同样,在6/7(预测任务)和5/7(推理任务)SDG模型中,下游效用保持稳定。当观察到显著的影响时,它们是最小的,例如,当增加120个变量时,受试者工作特征曲线下的面积(AUROC)减少0.05。隐私不受附加变量数量的影响。讨论:我们的研究结果表明,当生成比任务相关子集更全面的医疗数据集时,保真度、实用性和隐私性得到了保护。结论:我们的研究结果支持在医学研究和教育中实施可持续发展目标的一种具有成本效益、实用性和隐私保护的方式。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Should we synthesize more than we need: impact of synthetic data generation for high-dimensional cross-sectional medical data.

Objective: In medical research and education, generative artificial intelligence/machine learning (AI/ML) models to synthesize artificial medical data can enable the sharing of high-quality data while preserving the privacy of patients. Given that such data is often high-dimensional, a relevant consideration is whether to synthesize the entire dataset when only a task-relevant subset is needed. This study evaluates how the number of variables in training impacts fidelity, utility, and privacy of the synthetic data (SD).

Material and methods: We used 12 cross-sectional medical datasets, defined a downstream task with corresponding core variables, and derived 6354 variants by adding adjunct variables to the core. SD was generated using 7 different generative models and evaluated for fidelity, downstream utility, and privacy. Mixed-effect models were used to assess the effect of adjunct variables on the respective evaluation metric, accounting for the medical dataset as a random component.

Results: Fidelity was unaffected by the number of adjunct variables in 5/7 SDG models. Similarly, downstream utility remained stable in 6/7 (predictive task) and 5/7 (inferential task) SDG models. Where significant effects were observed, they were minimal, resulting, for example, in a 0.05 decrease in Area under the Receiver Operating Characteristic curve (AUROC) when adding 120 variables. Privacy was not impacted by the number of adjunct variables.

Discussion: Our findings show that fidelity, utility, and privacy are preserved when generating a more comprehensive medical dataset than the task-relevant subset.

Conclusion: Our findings support a cost-effective, utility, and privacy-preserving way of implementing SDG into medical research and education.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association 医学-计算机:跨学科应用
CiteScore
14.50
自引率
7.80%
发文量
230
审稿时长
3-8 weeks
期刊介绍: JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信