无条件潜伏扩散模型记忆患者成像数据

IF 26.8 1区医学 Q1 ENGINEERING, BIOMEDICAL

Nature Biomedical Engineering Pub Date : 2025-08-11 DOI:10.1038/s41551-025-01468-8

Salman Ul Hassan Dar, Marvin Seyfarth, Isabelle Ayx, Theano Papavassiliu, Stefan O. Schoenberg, Robert Malte Siepmann, Fabian Christopher Laqua, Jannik Kahmann, Norbert Frey, Bettina Baeßler, Sebastian Foersch, Daniel Truhn, Jakob Nikolas Kather, Sandy Engelhardt

{"title":"无条件潜伏扩散模型记忆患者成像数据","authors":"Salman Ul Hassan Dar, Marvin Seyfarth, Isabelle Ayx, Theano Papavassiliu, Stefan O. Schoenberg, Robert Malte Siepmann, Fabian Christopher Laqua, Jannik Kahmann, Norbert Frey, Bettina Baeßler, Sebastian Foersch, Daniel Truhn, Jakob Nikolas Kather, Sandy Engelhardt","doi":"10.1038/s41551-025-01468-8","DOIUrl":null,"url":null,"abstract":"<p>Generative artificial intelligence models facilitate open-data sharing by proposing synthetic data as surrogates of real patient data. Despite the promise for healthcare, some of these models are susceptible to patient data memorization, where models generate patient data copies instead of novel synthetic samples, resulting in patient re-identification. Here we assess memorization in unconditional latent diffusion models by training them on a variety of datasets for synthetic data generation and detecting memorization with a self-supervised copy detection approach. We show a high degree of patient data memorization across all datasets, with approximately 37.2% of patient data detected as memorized and 68.7% of synthetic samples identified as patient data copies. Latent diffusion models are more susceptible to memorization than autoencoders and generative adversarial networks, and they outperform non-diffusion models in synthesis quality. Augmentation strategies during training, small architecture size and increasing datasets can reduce memorization, while overtraining the models can enhance it. These results emphasize the importance of carefully training generative models on private medical imaging datasets and examining the synthetic data to ensure patient privacy.</p>","PeriodicalId":19063,"journal":{"name":"Nature Biomedical Engineering","volume":"290 1","pages":""},"PeriodicalIF":26.8000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unconditional latent diffusion models memorize patient imaging data\",\"authors\":\"Salman Ul Hassan Dar, Marvin Seyfarth, Isabelle Ayx, Theano Papavassiliu, Stefan O. Schoenberg, Robert Malte Siepmann, Fabian Christopher Laqua, Jannik Kahmann, Norbert Frey, Bettina Baeßler, Sebastian Foersch, Daniel Truhn, Jakob Nikolas Kather, Sandy Engelhardt\",\"doi\":\"10.1038/s41551-025-01468-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Generative artificial intelligence models facilitate open-data sharing by proposing synthetic data as surrogates of real patient data. Despite the promise for healthcare, some of these models are susceptible to patient data memorization, where models generate patient data copies instead of novel synthetic samples, resulting in patient re-identification. Here we assess memorization in unconditional latent diffusion models by training them on a variety of datasets for synthetic data generation and detecting memorization with a self-supervised copy detection approach. We show a high degree of patient data memorization across all datasets, with approximately 37.2% of patient data detected as memorized and 68.7% of synthetic samples identified as patient data copies. Latent diffusion models are more susceptible to memorization than autoencoders and generative adversarial networks, and they outperform non-diffusion models in synthesis quality. Augmentation strategies during training, small architecture size and increasing datasets can reduce memorization, while overtraining the models can enhance it. These results emphasize the importance of carefully training generative models on private medical imaging datasets and examining the synthetic data to ensure patient privacy.</p>\",\"PeriodicalId\":19063,\"journal\":{\"name\":\"Nature Biomedical Engineering\",\"volume\":\"290 1\",\"pages\":\"\"},\"PeriodicalIF\":26.8000,\"publicationDate\":\"2025-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Nature Biomedical Engineering\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1038/s41551-025-01468-8\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Biomedical Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1038/s41551-025-01468-8","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

摘要

生成式人工智能模型通过提出合成数据作为真实患者数据的替代品来促进开放数据共享。尽管这些模型有望用于医疗保健，但其中一些模型容易受到患者数据记忆的影响，其中模型生成患者数据副本，而不是新的合成样本，从而导致患者重新识别。在这里，我们通过在合成数据生成的各种数据集上训练无条件潜在扩散模型来评估记忆，并使用自监督复制检测方法检测记忆。我们在所有数据集中显示了高度的患者数据记忆，大约37.2%的患者数据被检测到被记忆，68.7%的合成样本被识别为患者数据副本。潜在扩散模型比自编码器和生成对抗网络更容易被记忆，并且在合成质量上优于非扩散模型。训练过程中的增强策略、较小的架构大小和不断增加的数据集会降低记忆，而过度训练模型会增强记忆。这些结果强调了在私人医疗成像数据集上仔细训练生成模型和检查合成数据以确保患者隐私的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Unconditional latent diffusion models memorize patient imaging data

查看原文本刊更多论文

Unconditional latent diffusion models memorize patient imaging data

Generative artificial intelligence models facilitate open-data sharing by proposing synthetic data as surrogates of real patient data. Despite the promise for healthcare, some of these models are susceptible to patient data memorization, where models generate patient data copies instead of novel synthetic samples, resulting in patient re-identification. Here we assess memorization in unconditional latent diffusion models by training them on a variety of datasets for synthetic data generation and detecting memorization with a self-supervised copy detection approach. We show a high degree of patient data memorization across all datasets, with approximately 37.2% of patient data detected as memorized and 68.7% of synthetic samples identified as patient data copies. Latent diffusion models are more susceptible to memorization than autoencoders and generative adversarial networks, and they outperform non-diffusion models in synthesis quality. Augmentation strategies during training, small architecture size and increasing datasets can reduce memorization, while overtraining the models can enhance it. These results emphasize the importance of carefully training generative models on private medical imaging datasets and examining the synthetic data to ensure patient privacy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Nature Biomedical Engineering Medicine-Medicine (miscellaneous)

CiteScore

45.30

自引率

1.10%

发文量

138

期刊介绍： Nature Biomedical Engineering is an online-only monthly journal that was launched in January 2017. It aims to publish original research, reviews, and commentary focusing on applied biomedicine and health technology. The journal targets a diverse audience, including life scientists who are involved in developing experimental or computational systems and methods to enhance our understanding of human physiology. It also covers biomedical researchers and engineers who are engaged in designing or optimizing therapies, assays, devices, or procedures for diagnosing or treating diseases. Additionally, clinicians, who make use of research outputs to evaluate patient health or administer therapy in various clinical settings and healthcare contexts, are also part of the target audience.