Synthetic data for pharmacogenetics: enabling scalable and secure research.

IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES
JAMIA Open Pub Date : 2025-10-03 eCollection Date: 2025-10-01 DOI:10.1093/jamiaopen/ooaf107
Marko Miletic, Anna Bollinger, Samuel S Allemann, Murat Sariyar
{"title":"Synthetic data for pharmacogenetics: enabling scalable and secure research.","authors":"Marko Miletic, Anna Bollinger, Samuel S Allemann, Murat Sariyar","doi":"10.1093/jamiaopen/ooaf107","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>This study evaluates the performance of 7 synthetic data generation (SDG) methods-synthpop, avatar, copula, copulagan, ctgan, tvae, and the large language models-based tabula-for supporting pharmacogenetics (PGx) research.</p><p><strong>Materials and methods: </strong>We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error ( <math><mi>pMSE</mi></math> ), (2) specific utility via weighted <math> <mrow> <msub><mrow><mi>F</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> </math> score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.</p><p><strong>Results: </strong>Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower <math><mi>pMSE</mi></math> but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted <math> <mrow> <msub><mrow><mi>F</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> </math> scores, especially under noise or data imbalance.</p><p><strong>Discussion: </strong>While deep learning models can achieve high distributional fidelity ( <math><mi>pMSE</mi></math> ), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility ( <math> <mrow> <msub><mrow><mi>F</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> </math> score), emphasizing the need for multimetric evaluation.</p><p><strong>Conclusion: </strong>No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 5","pages":"ooaf107"},"PeriodicalIF":3.4000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12492482/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: This study evaluates the performance of 7 synthetic data generation (SDG) methods-synthpop, avatar, copula, copulagan, ctgan, tvae, and the large language models-based tabula-for supporting pharmacogenetics (PGx) research.

Materials and methods: We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error ( pMSE ), (2) specific utility via weighted F 1 score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.

Results: Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower pMSE but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted F 1 scores, especially under noise or data imbalance.

Discussion: While deep learning models can achieve high distributional fidelity ( pMSE ), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility ( F 1 score), emphasizing the need for multimetric evaluation.

Conclusion: No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.

药物遗传学合成数据:实现可扩展和安全的研究。
目的:评价7种合成数据生成(SDG)方法(synthpop、avatar、copula、copulagan、ctgan、tvae以及基于大语言模型的表格)在支持药物遗传学(PGx)研究中的性能。材料和方法:我们使用了142例药物不良反应或治疗失败患者的PGx谱,考虑了两种情况:(1)高维基因型数据集(104个变量)和(2)表型数据集(24个变量)。(1)使用倾向得分均方误差(pMSE)评估模型的广泛效用,(2)使用训练-合成-测试-真实框架中的加权f1得分评估模型的特定效用,以及(3)使用ε-可识别性评估模型的隐私风险。结果:Copula和synthpop在两个数据集上都取得了良好的表现,结合了低ε-可识别性(0.25-0.35)和竞争性效用。像tabula和tvae这样经过10000次训练的深度学习模型获得了较低的pMSE,但具有较高的ε-可识别性(>.4),并且预测性能的收益有限。具体效用仅与广泛效用弱相关,表明分布保真度不能确保预测相关性。Copula和synthpop在加权f1分数上往往优于原始数据,特别是在噪声或数据不平衡的情况下。讨论:虽然深度学习模型可以实现高分布保真度(pMSE),但它们通常会导致更高的ε-可识别性,从而引起隐私问题。copula和synthpop等传统方法始终提供强大的实用程序和较低的重新识别风险,特别是对于高维数据。重要的是,一般效用不能预测特定效用(f1分数),强调了多指标评估的必要性。结论:没有单一的SDG方法在所有标准中占主导地位。对于隐私敏感的PGx应用程序,copula和synthpop等经典方法在实用程序和隐私之间提供了可靠的权衡,使它们更适合高维、有限样本设置。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
JAMIA Open
JAMIA Open Medicine-Health Informatics
CiteScore
4.10
自引率
4.80%
发文量
102
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信