{"title":"评估药物不良事件检测合成电子病历数据生成的隐私性和效用。","authors":"Thu Dinh, Hercules Dalianis","doi":"10.3233/SHTI251490","DOIUrl":null,"url":null,"abstract":"<p><p>This study examines the use of the Synthetic Data Vault (SDV) tool in generating synthetic EHR data for adverse drug events (ADE) detection. Experiments were conducted with three off-the-shelf synthetic data generators: GaussianCopula, Conditional Tabular Generative Adversarial Network (CTGAN) and Tabular Variational Autoencoder (TVAE), using a structured Swedish dataset. Evaluations included SynthEval metrics and downstream performance assessment using a 'train-on-synthetic, test-on-real' (TSTR) approach with Random Forest classifiers. Results show that TVAE's performance varied with dataset size and class balance, with larger datasets improving its performance. GaussianCopula provided more stable utility and stronger privacy protection at the cost of fidelity. CTGAN generated realistic data but exhibited inconsistent performance under TSTR evaluation. These findings highlight the importance of selecting synthetic data models based on healthcare application needs and dataset characteristics.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"332 ","pages":"32-36"},"PeriodicalIF":0.0000,"publicationDate":"2025-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Privacy and Utility in Synthetic EHR Data Generation for Adverse Drug Event Detection.\",\"authors\":\"Thu Dinh, Hercules Dalianis\",\"doi\":\"10.3233/SHTI251490\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>This study examines the use of the Synthetic Data Vault (SDV) tool in generating synthetic EHR data for adverse drug events (ADE) detection. Experiments were conducted with three off-the-shelf synthetic data generators: GaussianCopula, Conditional Tabular Generative Adversarial Network (CTGAN) and Tabular Variational Autoencoder (TVAE), using a structured Swedish dataset. Evaluations included SynthEval metrics and downstream performance assessment using a 'train-on-synthetic, test-on-real' (TSTR) approach with Random Forest classifiers. Results show that TVAE's performance varied with dataset size and class balance, with larger datasets improving its performance. GaussianCopula provided more stable utility and stronger privacy protection at the cost of fidelity. CTGAN generated realistic data but exhibited inconsistent performance under TSTR evaluation. These findings highlight the importance of selecting synthetic data models based on healthcare application needs and dataset characteristics.</p>\",\"PeriodicalId\":94357,\"journal\":{\"name\":\"Studies in health technology and informatics\",\"volume\":\"332 \",\"pages\":\"32-36\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Studies in health technology and informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/SHTI251490\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251490","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Evaluating Privacy and Utility in Synthetic EHR Data Generation for Adverse Drug Event Detection.
This study examines the use of the Synthetic Data Vault (SDV) tool in generating synthetic EHR data for adverse drug events (ADE) detection. Experiments were conducted with three off-the-shelf synthetic data generators: GaussianCopula, Conditional Tabular Generative Adversarial Network (CTGAN) and Tabular Variational Autoencoder (TVAE), using a structured Swedish dataset. Evaluations included SynthEval metrics and downstream performance assessment using a 'train-on-synthetic, test-on-real' (TSTR) approach with Random Forest classifiers. Results show that TVAE's performance varied with dataset size and class balance, with larger datasets improving its performance. GaussianCopula provided more stable utility and stronger privacy protection at the cost of fidelity. CTGAN generated realistic data but exhibited inconsistent performance under TSTR evaluation. These findings highlight the importance of selecting synthetic data models based on healthcare application needs and dataset characteristics.