Dominic Aeschbacher, Jessica Meisner, Marko Miletic, Murat Sariyar
{"title":"Use and Evaluation of GANs for Synthetic Data Generation in Pharmacogenetics.","authors":"Dominic Aeschbacher, Jessica Meisner, Marko Miletic, Murat Sariyar","doi":"10.3233/SHTI241100","DOIUrl":null,"url":null,"abstract":"<p><p>Pharmacogenetics (PGx) explores the influence of genetic variability on drug efficacy and tolerability. Synthetic Data Generation (SDG) has emerged as a promising alternative to the labor-intensive process of collecting real-world PGx data, which is required for high-qualitative prediction models. This study investigates the performance of two Generative Adversarial Network (GAN) models, CTGAN and CTAB-GAN+, in generating synthetic PGx data. The benchmarking is based on utility metrics (Hellinger distance and Random Forest accuracy) and ϵ-identifiability. Results demonstrate that synthetic data generated by CTAB-GAN+ can surpass the original dataset in terms of utility. For instance, CTAB-GAN+ achieves higher Random Forest accuracy compared to the original data, indicating better predictive performance. These improvements suggest that synthetic data not only capture the essential patterns of the original data but also enhance model generalization and prediction capabilities, providing a more robust training ground for machine learning models. Consequently, SDG offers a promising solution to address data scarcity and imbalance in pharmacogenetic research.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"321 ","pages":"240-244"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI241100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Pharmacogenetics (PGx) explores the influence of genetic variability on drug efficacy and tolerability. Synthetic Data Generation (SDG) has emerged as a promising alternative to the labor-intensive process of collecting real-world PGx data, which is required for high-qualitative prediction models. This study investigates the performance of two Generative Adversarial Network (GAN) models, CTGAN and CTAB-GAN+, in generating synthetic PGx data. The benchmarking is based on utility metrics (Hellinger distance and Random Forest accuracy) and ϵ-identifiability. Results demonstrate that synthetic data generated by CTAB-GAN+ can surpass the original dataset in terms of utility. For instance, CTAB-GAN+ achieves higher Random Forest accuracy compared to the original data, indicating better predictive performance. These improvements suggest that synthetic data not only capture the essential patterns of the original data but also enhance model generalization and prediction capabilities, providing a more robust training ground for machine learning models. Consequently, SDG offers a promising solution to address data scarcity and imbalance in pharmacogenetic research.