Marko Miletic, Anna Bollinger, Samuel S Allemann, Murat Sariyar
{"title":"Synthetic data for pharmacogenetics: enabling scalable and secure research.","authors":"Marko Miletic, Anna Bollinger, Samuel S Allemann, Murat Sariyar","doi":"10.1093/jamiaopen/ooaf107","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>This study evaluates the performance of 7 synthetic data generation (SDG) methods-synthpop, avatar, copula, copulagan, ctgan, tvae, and the large language models-based tabula-for supporting pharmacogenetics (PGx) research.</p><p><strong>Materials and methods: </strong>We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error ( <math><mi>pMSE</mi></math> ), (2) specific utility via weighted <math> <mrow> <msub><mrow><mi>F</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> </math> score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.</p><p><strong>Results: </strong>Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower <math><mi>pMSE</mi></math> but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted <math> <mrow> <msub><mrow><mi>F</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> </math> scores, especially under noise or data imbalance.</p><p><strong>Discussion: </strong>While deep learning models can achieve high distributional fidelity ( <math><mi>pMSE</mi></math> ), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility ( <math> <mrow> <msub><mrow><mi>F</mi></mrow> <mrow><mn>1</mn></mrow> </msub> </mrow> </math> score), emphasizing the need for multimetric evaluation.</p><p><strong>Conclusion: </strong>No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 5","pages":"ooaf107"},"PeriodicalIF":3.4000,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12492482/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/10/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Objective: This study evaluates the performance of 7 synthetic data generation (SDG) methods-synthpop, avatar, copula, copulagan, ctgan, tvae, and the large language models-based tabula-for supporting pharmacogenetics (PGx) research.
Materials and methods: We used PGx profiles from 142 patients with adverse drug reactions or therapeutic failures, considering 2 scenarios: (1) a high-dimensional genotype dataset (104 variables) and (2) a phenotype dataset (24 variables). Models were assessed for (1) broad utility using propensity score mean squared error ( ), (2) specific utility via weighted score in a Train-Synthetic-Test-Real framework, and (3) privacy risk as ε-identifiability.
Results: Copula and synthpop consistently achieved strong performance across both datasets, combining low ε-identifiability (0.25-0.35) with competitive utility. Deep learning models like tabula and tvae trained for 10 000 epochs achieved lower but had higher ε-identifiability (>0.4) and limited gains in predictive performance. Specific utility was only weakly correlated with broad utility, indicating that distributional fidelity does not ensure predictive relevance. Copula and synthpop often outperformed original data in weighted scores, especially under noise or data imbalance.
Discussion: While deep learning models can achieve high distributional fidelity ( ), they often incur elevated ε-identifiability, raising privacy concerns. Traditional methods like copula and synthpop consistently offer robust utility and lower re-identification risk, particularly for high-dimensional data. Importantly, general utility does not predict specific utility ( score), emphasizing the need for multimetric evaluation.
Conclusion: No single SDG method dominated across all criteria. For privacy-sensitive PGx applications, classical methods such as copula and synthpop offer a reliable trade-off between utility and privacy, making them preferable for high-dimensional, limited-sample settings.