Synthetic data in biomedicine via generative artificial intelligence

Nature reviews bioengineering Pub Date : 2024-10-08 DOI:10.1038/s44222-024-00245-7

Boris van Breugel, Tennison Liu, Dino Oglic, Mihaela van der Schaar

{"title":"Synthetic data in biomedicine via generative artificial intelligence","authors":"Boris van Breugel, Tennison Liu, Dino Oglic, Mihaela van der Schaar","doi":"10.1038/s44222-024-00245-7","DOIUrl":null,"url":null,"abstract":"The creation and application of data in biomedicine and healthcare often face privacy constraints, bias, distributional shifts, underrepresentation of certain groups and data scarcity. Some of these challenges may be addressed by synthetic data, which can be generated by deep generative models. In this Review, we highlight how data-driven synthetic data can be created not only to overcome privacy concerns associated with real data, but also to expand and improve real data. In particular, generative-model-based data augmentation can address data scarcity; synthetic data can improve data fairness and reduce bias by accounting for underrepresented groups; and unseen scenarios may be simulated with synthetic data. We further examine how biomedically relevant data, such as molecular, imaging and tabular data, may be created by foundation models through query-specific generation. We outline the challenges associated with ownership, publication, sharing and access of synthetic data. Importantly, we discuss approaches that can be applied to measure the quality of data generated by deep generative models to improve trust in synthetic data and the results derived from such data. Synthetic data can be created by deep generative models to address challenges associated with real data, such as privacy issues, bias and data scarcity. This Review discusses the generation and application of synthetic data in biomedicine and bioengineering, including quality assessment and validation.","PeriodicalId":74248,"journal":{"name":"Nature reviews bioengineering","volume":"2 12","pages":"991-1004"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature reviews bioengineering","FirstCategoryId":"1085","ListUrlMain":"https://www.nature.com/articles/s44222-024-00245-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The creation and application of data in biomedicine and healthcare often face privacy constraints, bias, distributional shifts, underrepresentation of certain groups and data scarcity. Some of these challenges may be addressed by synthetic data, which can be generated by deep generative models. In this Review, we highlight how data-driven synthetic data can be created not only to overcome privacy concerns associated with real data, but also to expand and improve real data. In particular, generative-model-based data augmentation can address data scarcity; synthetic data can improve data fairness and reduce bias by accounting for underrepresented groups; and unseen scenarios may be simulated with synthetic data. We further examine how biomedically relevant data, such as molecular, imaging and tabular data, may be created by foundation models through query-specific generation. We outline the challenges associated with ownership, publication, sharing and access of synthetic data. Importantly, we discuss approaches that can be applied to measure the quality of data generated by deep generative models to improve trust in synthetic data and the results derived from such data. Synthetic data can be created by deep generative models to address challenges associated with real data, such as privacy issues, bias and data scarcity. This Review discusses the generation and application of synthetic data in biomedicine and bioengineering, including quality assessment and validation.

Abstract Image

查看原文本刊更多论文

基于生成人工智能的生物医学合成数据

在生物医学和医疗保健领域，数据的创建和应用经常面临隐私限制、偏见、分布变化、某些群体代表性不足和数据稀缺等问题。其中一些挑战可以通过合成数据来解决，这些数据可以由深度生成模型生成。在本综述中，我们重点介绍了如何创建数据驱动的合成数据，不仅可以克服与真实数据相关的隐私问题，还可以扩展和改进真实数据。特别是，基于生成模型的数据增强可以解决数据稀缺性问题；综合数据可以通过考虑代表性不足的群体来提高数据公平性和减少偏见；而未知的场景可以用合成数据来模拟。我们进一步研究了如何通过特定查询生成基础模型来创建生物医学相关数据，如分子、成像和表格数据。我们概述了与合成数据的所有权、出版、共享和访问相关的挑战。重要的是，我们讨论了可用于测量深度生成模型生成的数据质量的方法，以提高对合成数据和从这些数据派生的结果的信任。合成数据可以通过深度生成模型创建，以解决与真实数据相关的挑战，例如隐私问题、偏见和数据稀缺。本文综述了合成数据在生物医学和生物工程中的产生和应用，包括质量评价和验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nature reviews bioengineering

自引率

0.00%

发文量