Artificial intelligence-generated synthetic data for cancer research and clinical trials

IF 66.8 1区医学 Q1 ONCOLOGY

Nature Reviews Cancer Pub Date : 2026-02-20 DOI:10.1038/s41568-026-00912-4

Jan-Niklas Eckardt, Waldemar Hahn, Arsela Prelaj, Martin Bornhäuser, Jan Moritz Middeke, Jakob Nikolas Kather

{"title":"Artificial intelligence-generated synthetic data for cancer research and clinical trials","authors":"Jan-Niklas Eckardt, Waldemar Hahn, Arsela Prelaj, Martin Bornhäuser, Jan Moritz Middeke, Jakob Nikolas Kather","doi":"10.1038/s41568-026-00912-4","DOIUrl":null,"url":null,"abstract":"Synthetic data, generated through advanced artificial intelligence models, are gaining traction in healthcare research, particularly in high-stakes fields such as haematology and oncology. By replicating statistical properties, intervariable relationships and behaviours of real-world data, synthetic data sets can serve as valuable supplements or substitutes for conventional medical data. They offer the potential to overcome barriers to data access and sharing, democratize scientific discovery, and reduce the costs and failure rates of clinical trials. However, the lack of standardization in training data selection, model evaluation, bias mitigation, privacy preservation and quality assurance remain major challenges, limiting their reliability and safe application. In this Review, we explore the role of synthetic data in cancer research and clinical trials, present real-world examples of their use, critically examine limitations and pitfalls, and propose best practices to enhance fidelity, validity, fairness and utility. Although synthetic data are not a ‘silver bullet’ for the challenges of clinical research, with rigorous validation and oversight, they have the potential to transform data sharing, scientific collaboration and clinical trial design. Synthetic data generated by generative artificial intelligence models can serve as a substitute for real patient data. In this Review, Eckardt et al. discuss how synthetic data sets can overcome barriers to data access and sharing, democratize scientific discovery in cancer research, and reduce the costs and failure rates of cancer clinical trials. They also discuss how this will only become possible if we can overcome the challenges of a lack of standardization in training data selection, model evaluation, bias mitigation, privacy preservation and quality assurance.","PeriodicalId":19055,"journal":{"name":"Nature Reviews Cancer","volume":"26 5","pages":"351-363"},"PeriodicalIF":66.8000,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Reviews Cancer","FirstCategoryId":"3","ListUrlMain":"https://www.nature.com/articles/s41568-026-00912-4","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ONCOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Synthetic data, generated through advanced artificial intelligence models, are gaining traction in healthcare research, particularly in high-stakes fields such as haematology and oncology. By replicating statistical properties, intervariable relationships and behaviours of real-world data, synthetic data sets can serve as valuable supplements or substitutes for conventional medical data. They offer the potential to overcome barriers to data access and sharing, democratize scientific discovery, and reduce the costs and failure rates of clinical trials. However, the lack of standardization in training data selection, model evaluation, bias mitigation, privacy preservation and quality assurance remain major challenges, limiting their reliability and safe application. In this Review, we explore the role of synthetic data in cancer research and clinical trials, present real-world examples of their use, critically examine limitations and pitfalls, and propose best practices to enhance fidelity, validity, fairness and utility. Although synthetic data are not a ‘silver bullet’ for the challenges of clinical research, with rigorous validation and oversight, they have the potential to transform data sharing, scientific collaboration and clinical trial design. Synthetic data generated by generative artificial intelligence models can serve as a substitute for real patient data. In this Review, Eckardt et al. discuss how synthetic data sets can overcome barriers to data access and sharing, democratize scientific discovery in cancer research, and reduce the costs and failure rates of cancer clinical trials. They also discuss how this will only become possible if we can overcome the challenges of a lack of standardization in training data selection, model evaluation, bias mitigation, privacy preservation and quality assurance.

Abstract Image

查看原文本刊更多论文

人工智能为癌症研究和临床试验生成合成数据。

通过先进的人工智能模型生成的合成数据正在医疗保健研究领域获得越来越多的关注，尤其是在血液学和肿瘤学等高风险领域。通过复制真实世界数据的统计特性、变量间关系和行为，合成数据集可以作为传统医疗数据的有价值的补充或替代品。它们提供了克服数据访问和共享障碍的潜力，使科学发现民主化，降低临床试验的成本和失败率。然而，在训练数据选择、模型评估、减少偏见、隐私保护和质量保证方面缺乏标准化仍然是主要挑战，限制了它们的可靠性和安全应用。在这篇综述中，我们探讨了合成数据在癌症研究和临床试验中的作用，提出了真实世界中使用合成数据的例子，批判性地检查了其局限性和缺陷，并提出了提高保真度、有效性、公平性和实用性的最佳实践。尽管合成数据并不是应对临床研究挑战的“灵丹妙药”，但在严格的验证和监督下，它们有可能改变数据共享、科学合作和临床试验设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nature Reviews Cancer 医学-肿瘤学

CiteScore

111.90

自引率

0.40%

发文量

审稿时长

6-12 weeks

期刊介绍： Nature Reviews Cancer, a part of the Nature Reviews portfolio of journals, aims to be the premier source of reviews and commentaries for the scientific communities it serves. The correct abbreviation for abstracting and indexing purposes is Nat. Rev. Cancer. The international standard serial numbers (ISSN) for Nature Reviews Cancer are 1474-175X (print) and 1474-1768 (online). Unlike other journals, Nature Reviews Cancer does not have an external editorial board. Instead, all editorial decisions are made by a team of full-time professional editors who are PhD-level scientists. The journal publishes Research Highlights, Comments, Reviews, and Perspectives relevant to cancer researchers, ensuring that the articles reach the widest possible audience due to their broad scope.