利用不同的私有合成数据从分布式数据中进行协作学习。

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2024-06-14 DOI:10.1186/s12911-024-02563-7

Lukas Prediger, Joonas Jälkö, Antti Honkela, Samuel Kaski

{"title":"利用不同的私有合成数据从分布式数据中进行协作学习。","authors":"Lukas Prediger, Joonas Jälkö, Antti Honkela, Samuel Kaski","doi":"10.1186/s12911-024-02563-7","DOIUrl":null,"url":null,"abstract":"Background: Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank.Methods: We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study's Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores.Results: We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups.Conclusions: Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11179391/pdf/","citationCount":"0","resultStr":"{\"title\":\"Collaborative learning from distributed data with differentially private synthetic data.\",\"authors\":\"Lukas Prediger, Joonas Jälkö, Antti Honkela, Samuel Kaski\",\"doi\":\"10.1186/s12911-024-02563-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank.Methods: We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study's Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores.Results: We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups.Conclusions: Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-06-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11179391/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-024-02563-7\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02563-7","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：考虑这样一种情况：持有敏感数据的多方旨在合作学习人口级统计数据，但由于隐私问题，无法汇集敏感数据集，各方也无法进行集中协调的联合计算。我们研究了结合隐私保护合成数据集代替原始数据，对英国生物库的真实健康数据进行协作学习的可行性：我们根据文献中现有的前瞻性队列研究进行了实证评估。我们通过将英国生物库队列按评估中心拆分来模拟多方，并使用差异化私有生成建模技术生成合成数据。然后，我们将原始研究的泊松回归分析应用于合并的合成数据集，并评估 1）本地数据集的大小；2）参与方的数量；3）分布的局部变化对所得似然比分数的影响：我们发现，与仅使用本地数据相比，通过共享合成数据参与协作学习的各方能获得更准确的回归参数估计。这一发现也适用于小型异构数据集的困难情况。此外，参与方越多，在一定限度内，改进幅度越大，一致性越强。最后，我们发现数据共享尤其有助于数据中包含代表性不足群体的各方对这些群体进行更好的调整分析：根据我们的研究结果，我们得出结论：共享合成数据是一种可行的方法，可以在不违反隐私限制的情况下从敏感数据中进行学习，即使单个数据集很小或不能很好地代表整个群体。无法访问分布式敏感数据往往是生物医学研究的瓶颈，而我们的研究表明，保护隐私的协作学习方法可以缓解这一问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Collaborative learning from distributed data with differentially private synthetic data.

Background: Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank.

Methods: We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study's Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores.

Results: We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups.

Conclusions: Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.