Chufan Gao, Mandis Beigi, Afrah Shafquat, Jacob Aptekar, Jimeng Sun
{"title":"TrialSynth: Generation of Synthetic Sequential Clinical Trial Data","authors":"Chufan Gao, Mandis Beigi, Afrah Shafquat, Jacob Aptekar, Jimeng Sun","doi":"arxiv-2409.07089","DOIUrl":null,"url":null,"abstract":"Analyzing data from past clinical trials is part of the ongoing effort to\noptimize the design, implementation, and execution of new clinical trials and\nmore efficiently bring life-saving interventions to market. While there have\nbeen recent advances in the generation of static context synthetic clinical\ntrial data, due to both limited patient availability and constraints imposed by\npatient privacy needs, the generation of fine-grained synthetic time-sequential\nclinical trial data has been challenging. Given that patient trajectories over\nan entire clinical trial are of high importance for optimizing trial design and\nefforts to prevent harmful adverse events, there is a significant need for the\ngeneration of high-fidelity time-sequence clinical trial data. Here we\nintroduce TrialSynth, a Variational Autoencoder (VAE) designed to address the\nspecific challenges of generating synthetic time-sequence clinical trial data.\nDistinct from related clinical data VAE methods, the core of our method\nleverages Hawkes Processes (HP), which are particularly well-suited for\nmodeling event-type and time gap prediction needed to capture the structure of\nsequential clinical trial data. Our experiments demonstrate that TrialSynth\nsurpasses the performance of other comparable methods that can generate\nsequential clinical trial data, in terms of both fidelity and in enabling the\ngeneration of highly accurate event sequences across multiple real-world\nsequential event datasets with small patient source populations when using\nminimal external information. Notably, our empirical findings highlight that\nTrialSynth not only outperforms existing clinical sequence-generating methods\nbut also produces data with superior utility while empirically preserving\npatient privacy.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Analyzing data from past clinical trials is part of the ongoing effort to
optimize the design, implementation, and execution of new clinical trials and
more efficiently bring life-saving interventions to market. While there have
been recent advances in the generation of static context synthetic clinical
trial data, due to both limited patient availability and constraints imposed by
patient privacy needs, the generation of fine-grained synthetic time-sequential
clinical trial data has been challenging. Given that patient trajectories over
an entire clinical trial are of high importance for optimizing trial design and
efforts to prevent harmful adverse events, there is a significant need for the
generation of high-fidelity time-sequence clinical trial data. Here we
introduce TrialSynth, a Variational Autoencoder (VAE) designed to address the
specific challenges of generating synthetic time-sequence clinical trial data.
Distinct from related clinical data VAE methods, the core of our method
leverages Hawkes Processes (HP), which are particularly well-suited for
modeling event-type and time gap prediction needed to capture the structure of
sequential clinical trial data. Our experiments demonstrate that TrialSynth
surpasses the performance of other comparable methods that can generate
sequential clinical trial data, in terms of both fidelity and in enabling the
generation of highly accurate event sequences across multiple real-world
sequential event datasets with small patient source populations when using
minimal external information. Notably, our empirical findings highlight that
TrialSynth not only outperforms existing clinical sequence-generating methods
but also produces data with superior utility while empirically preserving
patient privacy.