ProcessGAN: Generating Privacy-Preserving Time-Aware Process Data with Conditional Generative Adversarial Nets.

IF 4.8 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data Pub Date : 2024-11-01 Epub Date: 2024-11-12 DOI:10.1145/3687464

Keyi Li, Sen Yang, Travis M Sullivan, Randall S Burd, Ivan Marsic

{"title":"ProcessGAN: Generating Privacy-Preserving Time-Aware Process Data with Conditional Generative Adversarial Nets.","authors":"Keyi Li, Sen Yang, Travis M Sullivan, Randall S Burd, Ivan Marsic","doi":"10.1145/3687464","DOIUrl":null,"url":null,"abstract":"<p><p>Process data constructed from event logs provides valuable insights into procedural dynamics over time. The confidential information in process data, together with the data's intricate nature, makes the datasets not sharable and challenging to collect. Consequently, research is limited using process data and analytics in the process mining domain. In this study, we introduced a synthetic process data generation task to address the limitation of sharable process data. We introduced a generative adversarial network, called ProcessGAN, to generate process data with activity sequences and corresponding timestamps. ProcessGAN consists of a transformer-based network as the generator, and a time-aware self-attention network as the discriminator. It can generate privacy-preserving process data from random noise. ProcessGAN considers the duration of the process and time intervals between activities to generate realistic activity sequences with timestamps. We evaluated ProcessGAN on five real-world datasets, two that are public and three collected in medical domains that are private. To evaluate the synthetic data, in addition to statistical metrics, we trained a supervised model to score the synthetic processes. We also used process mining to discover workflows for synthetic medical processes and had domain experts evaluate the clinical applicability of the synthetic workflows. ProcessGAN outperformed the existing generative models in generating complex processes with valid parallel pathways. The synthetic process data generated by ProcessGAN better represented the long-range dependencies between activities, a feature relevant to complicated medical and other processes. The timestamps generated by the ProcessGAN model showed similar distributions with the authentic timestamps. In addition, we trained a transformer-based network to generate synthetic contexts (e.g., patient demographics) that were associated with the synthetic processes. The synthetic contexts generated by our model outperformed the baseline models, with the distributions similar to the authentic contexts. We conclude that ProcessGAN can generate sharable synthetic process data indistinguishable from authentic data. Our source code is available in https://github.com/raaachli/ProcessGAN.</p>","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"18 9","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12369952/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3687464","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Process data constructed from event logs provides valuable insights into procedural dynamics over time. The confidential information in process data, together with the data's intricate nature, makes the datasets not sharable and challenging to collect. Consequently, research is limited using process data and analytics in the process mining domain. In this study, we introduced a synthetic process data generation task to address the limitation of sharable process data. We introduced a generative adversarial network, called ProcessGAN, to generate process data with activity sequences and corresponding timestamps. ProcessGAN consists of a transformer-based network as the generator, and a time-aware self-attention network as the discriminator. It can generate privacy-preserving process data from random noise. ProcessGAN considers the duration of the process and time intervals between activities to generate realistic activity sequences with timestamps. We evaluated ProcessGAN on five real-world datasets, two that are public and three collected in medical domains that are private. To evaluate the synthetic data, in addition to statistical metrics, we trained a supervised model to score the synthetic processes. We also used process mining to discover workflows for synthetic medical processes and had domain experts evaluate the clinical applicability of the synthetic workflows. ProcessGAN outperformed the existing generative models in generating complex processes with valid parallel pathways. The synthetic process data generated by ProcessGAN better represented the long-range dependencies between activities, a feature relevant to complicated medical and other processes. The timestamps generated by the ProcessGAN model showed similar distributions with the authentic timestamps. In addition, we trained a transformer-based network to generate synthetic contexts (e.g., patient demographics) that were associated with the synthetic processes. The synthetic contexts generated by our model outperformed the baseline models, with the distributions similar to the authentic contexts. We conclude that ProcessGAN can generate sharable synthetic process data indistinguishable from authentic data. Our source code is available in https://github.com/raaachli/ProcessGAN.

查看原文本刊更多论文

使用条件生成对抗网络生成隐私保护的时间感知过程数据。

从事件日志构造的过程数据提供了对过程动态的有价值的见解。过程数据中的机密信息，再加上数据的复杂性，使得数据集不可共享，难以收集。因此，在过程挖掘领域中使用过程数据和分析的研究是有限的。在本研究中，我们引入了一个合成过程数据生成任务来解决可共享过程数据的局限性。我们引入了一个生成式对抗网络，称为ProcessGAN，用于生成带有活动序列和相应时间戳的过程数据。ProcessGAN由一个基于变压器的网络作为发生器，一个时间感知的自关注网络作为鉴别器。它可以从随机噪声中生成保护隐私的过程数据。ProcessGAN考虑流程的持续时间和活动之间的时间间隔，以生成具有时间戳的实际活动序列。我们在五个真实世界的数据集上对ProcessGAN进行了评估，其中两个是公开的，另外三个是在医疗领域收集的私有数据集。为了评估合成数据，除了统计指标外，我们还训练了一个监督模型来对合成过程进行评分。我们还使用流程挖掘来发现合成医疗流程的工作流，并让领域专家评估合成工作流的临床适用性。ProcessGAN在生成具有有效并行路径的复杂过程方面优于现有的生成模型。ProcessGAN生成的合成过程数据更好地表示了活动之间的长期依赖关系，这是一个与复杂的医疗过程和其他过程相关的特征。ProcessGAN模型生成的时间戳显示了与真实时间戳相似的分布。此外，我们训练了一个基于变压器的网络来生成与合成过程相关的合成上下文（例如，患者人口统计数据）。我们的模型生成的合成上下文优于基线模型，其分布与真实上下文相似。我们得出结论，ProcessGAN可以生成与真实数据难以区分的可共享的合成过程数据。我们的源代码可在https://github.com/raaachli/ProcessGAN中获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Knowledge Discovery from Data COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

6.70

自引率

5.60%

发文量

172

审稿时长

3 months

期刊介绍： TKDD welcomes papers on a full range of research in the knowledge discovery and analysis of diverse forms of data. Such subjects include, but are not limited to: scalable and effective algorithms for data mining and big data analysis, mining brain networks, mining data streams, mining multi-media data, mining high-dimensional data, mining text, Web, and semi-structured data, mining spatial and temporal data, data mining for community generation, social network analysis, and graph structured data, security and privacy issues in data mining, visual, interactive and online data mining, pre-processing and post-processing for data mining, robust and scalable statistical methods, data mining languages, foundations of data mining, KDD framework and process, and novel applications and infrastructures exploiting data mining technology including massively parallel processing and cloud computing platforms. TKDD encourages papers that explore the above subjects in the context of large distributed networks of computers, parallel or multiprocessing computers, or new data devices. TKDD also encourages papers that describe emerging data mining applications that cannot be satisfied by the current data mining technology.