Releasing differentially private event logs using generative models

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering Pub Date : 2025-04-15 DOI:10.1016/j.datak.2025.102450

Frederik Wangelik, Majid Rafiei, Mahsa Pourbafrani, Wil M.P. van der Aalst

{"title":"Releasing differentially private event logs using generative models","authors":"Frederik Wangelik, Majid Rafiei, Mahsa Pourbafrani, Wil M.P. van der Aalst","doi":"10.1016/j.datak.2025.102450","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, the industry has been witnessing an extended usage of process mining and automated event data analysis. Consequently, there is a rising significance in addressing privacy apprehensions related to the inclusion of sensitive and private information within event data utilized by process mining algorithms. State-of-the-art research mainly focuses on providing quantifiable privacy guarantees, e.g., via differential privacy, for trace variants that are used by the main process mining techniques, e.g., process discovery. However, privacy preservation techniques designed for the release of trace variants are still insufficient to meet all the demands of industry-scale utilization. Moreover, ensuring privacy guarantees in situations characterized by a high occurrence of infrequent trace variants remains a challenging endeavor. In this paper, we introduce two novel approaches for releasing differentially private trace variants based on trained generative models. With TraVaG, we leverage <em>Generative Adversarial Networks</em> (GANs) to sample from a privatized implicit variant distribution. Our second method employs <em>Denoising Diffusion Probabilistic Models</em> that reconstruct artificial trace variants from noise via trained Markov chains. Both methods offer industry-scale benefits and elevate the degree of privacy assurances, particularly in scenarios featuring a substantial prevalence of infrequent variants. Also, they overcome the shortcomings of conventional privacy preservation techniques, such as bounding the length of variants and introducing fake variants. Experimental results on real-life event data demonstrate that our approaches surpass state-of-the-art techniques in terms of privacy guarantees and utility preservation.</div></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"159 ","pages":"Article 102450"},"PeriodicalIF":2.7000,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X2500045X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, the industry has been witnessing an extended usage of process mining and automated event data analysis. Consequently, there is a rising significance in addressing privacy apprehensions related to the inclusion of sensitive and private information within event data utilized by process mining algorithms. State-of-the-art research mainly focuses on providing quantifiable privacy guarantees, e.g., via differential privacy, for trace variants that are used by the main process mining techniques, e.g., process discovery. However, privacy preservation techniques designed for the release of trace variants are still insufficient to meet all the demands of industry-scale utilization. Moreover, ensuring privacy guarantees in situations characterized by a high occurrence of infrequent trace variants remains a challenging endeavor. In this paper, we introduce two novel approaches for releasing differentially private trace variants based on trained generative models. With TraVaG, we leverage Generative Adversarial Networks (GANs) to sample from a privatized implicit variant distribution. Our second method employs Denoising Diffusion Probabilistic Models that reconstruct artificial trace variants from noise via trained Markov chains. Both methods offer industry-scale benefits and elevate the degree of privacy assurances, particularly in scenarios featuring a substantial prevalence of infrequent variants. Also, they overcome the shortcomings of conventional privacy preservation techniques, such as bounding the length of variants and introducing fake variants. Experimental results on real-life event data demonstrate that our approaches surpass state-of-the-art techniques in terms of privacy guarantees and utility preservation.

查看原文本刊更多论文

使用生成模型发布不同的私有事件日志

近年来，该行业见证了流程挖掘和自动化事件数据分析的扩展使用。因此，在处理过程挖掘算法使用的事件数据中包含敏感和私有信息相关的隐私担忧方面，具有越来越重要的意义。最先进的研究主要集中于提供可量化的隐私保证，例如，通过差分隐私，用于主要过程挖掘技术（例如，过程发现）使用的跟踪变量。然而，为跟踪变体的发布而设计的隐私保护技术仍然不足以满足工业规模利用的所有需求。此外，在以不频繁跟踪变体的高发生率为特征的情况下确保隐私保证仍然是一项具有挑战性的工作。本文介绍了两种基于训练生成模型的差分私有跟踪变量释放的新方法。在TraVaG中，我们利用生成对抗网络（gan）从私有化的隐式变量分布中进行采样。我们的第二种方法采用去噪扩散概率模型，通过训练好的马尔可夫链从噪声中重建人工痕迹变体。这两种方法都提供了行业规模的好处，并提高了隐私保证的程度，特别是在具有大量罕见变体的情况下。此外，它们还克服了传统隐私保护技术的缺点，例如限制变体的长度和引入假变体。真实事件数据的实验结果表明，我们的方法在隐私保障和效用保护方面超越了最先进的技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data & Knowledge Engineering 工程技术-计算机：人工智能

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

6 months

期刊介绍： Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.