Synthetic data for pharmacogenetics: enabling scalable and secure research.

IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2025-10-03 eCollection Date: 2025-10-01 DOI:10.1093/jamiaopen/ooaf107

Marko Miletic, Anna Bollinger, Samuel S Allemann, Murat Sariyar

{"title":"Synthetic data for pharmacogenetics: enabling scalable and secure research.","authors":"Marko Miletic, Anna Bollinger, Samuel S Allemann, Murat Sariyar","doi":"10.1093/jamiaopen/ooaf107","DOIUrl":null,"url":null,"abstract":"Objective: This study evaluates the performance of 7 synthetic data generation (SDG) methods-synthpop, avatar, copula, copulagan, ctgan, tvae, and the large language models-based tabula-for supporting pharmacogenetics (PGx) research.Materials and methods: We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error ( <math><mi>pMSE</mi></math> ), (2) specific utility via weighted <math> <mrow> <msub><mrow><mi>F</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> </math> score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.Results: Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower <math><mi>pMSE</mi></math> but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted <math> <mrow> <msub><mrow><mi>F</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> </math> scores, especially under noise or data imbalance.Discussion: While deep learning models can achieve high distributional fidelity ( <math><mi>pMSE</mi></math> ), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility ( <math> <mrow> <msub><mrow><mi>F</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> </math> score), emphasizing the need for multimetric evaluation.Conclusion: No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 5","pages":"ooaf107"},"PeriodicalIF":3.4000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12492482/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This study evaluates the performance of 7 synthetic data generation (SDG) methods-synthpop, avatar, copula, copulagan, ctgan, tvae, and the large language models-based tabula-for supporting pharmacogenetics (PGx) research.

Materials and methods: We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error ( $pMSE$ ), (2) specific utility via weighted $F_{1}$ score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.

Results: Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower $pMSE$ but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted $F_{1}$ scores, especially under noise or data imbalance.

Discussion: While deep learning models can achieve high distributional fidelity ( $pMSE$ ), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility ( $F_{1}$ score), emphasizing the need for multimetric evaluation.

Conclusion: No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.

查看原文本刊更多论文

药物遗传学合成数据：实现可扩展和安全的研究。

目的：评价7种合成数据生成（SDG）方法（synthpop、avatar、copula、copulagan、ctgan、tvae以及基于大语言模型的表格）在支持药物遗传学（PGx）研究中的性能。材料和方法：我们使用了142例药物不良反应或治疗失败患者的PGx谱，考虑了两种情况：(1)高维基因型数据集（104个变量）和(2)表型数据集（24个变量）。(1)使用倾向得分均方误差（pMSE）评估模型的广泛效用，(2)使用训练-合成-测试-真实框架中的加权f1得分评估模型的特定效用，以及(3)使用ε-可识别性评估模型的隐私风险。结果：Copula和synthpop在两个数据集上都取得了良好的表现，结合了低ε-可识别性（0.25-0.35）和竞争性效用。像tabula和tvae这样经过10000次训练的深度学习模型获得了较低的pMSE，但具有较高的ε-可识别性（>.4），并且预测性能的收益有限。具体效用仅与广泛效用弱相关，表明分布保真度不能确保预测相关性。Copula和synthpop在加权f1分数上往往优于原始数据，特别是在噪声或数据不平衡的情况下。讨论：虽然深度学习模型可以实现高分布保真度（pMSE），但它们通常会导致更高的ε-可识别性，从而引起隐私问题。copula和synthpop等传统方法始终提供强大的实用程序和较低的重新识别风险，特别是对于高维数据。重要的是，一般效用不能预测特定效用（f1分数），强调了多指标评估的必要性。结论：没有单一的SDG方法在所有标准中占主导地位。对于隐私敏感的PGx应用程序，copula和synthpop等经典方法在实用程序和隐私之间提供了可靠的权衡，使它们更适合高维、有限样本设置。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊