Jan-Niklas Eckardt, Waldemar Hahn, Arsela Prelaj, Martin Bornhäuser, Jan Moritz Middeke, Jakob Nikolas Kather
{"title":"Artificial intelligence-generated synthetic data for cancer research and clinical trials","authors":"Jan-Niklas Eckardt, Waldemar Hahn, Arsela Prelaj, Martin Bornhäuser, Jan Moritz Middeke, Jakob Nikolas Kather","doi":"10.1038/s41568-026-00912-4","DOIUrl":null,"url":null,"abstract":"Synthetic data, generated through advanced artificial intelligence models, are gaining traction in healthcare research, particularly in high-stakes fields such as haematology and oncology. By replicating statistical properties, intervariable relationships and behaviours of real-world data, synthetic data sets can serve as valuable supplements or substitutes for conventional medical data. They offer the potential to overcome barriers to data access and sharing, democratize scientific discovery, and reduce the costs and failure rates of clinical trials. However, the lack of standardization in training data selection, model evaluation, bias mitigation, privacy preservation and quality assurance remain major challenges, limiting their reliability and safe application. In this Review, we explore the role of synthetic data in cancer research and clinical trials, present real-world examples of their use, critically examine limitations and pitfalls, and propose best practices to enhance fidelity, validity, fairness and utility. Although synthetic data are not a ‘silver bullet’ for the challenges of clinical research, with rigorous validation and oversight, they have the potential to transform data sharing, scientific collaboration and clinical trial design. Synthetic data generated by generative artificial intelligence models can serve as a substitute for real patient data. In this Review, Eckardt et al. discuss how synthetic data sets can overcome barriers to data access and sharing, democratize scientific discovery in cancer research, and reduce the costs and failure rates of cancer clinical trials. They also discuss how this will only become possible if we can overcome the challenges of a lack of standardization in training data selection, model evaluation, bias mitigation, privacy preservation and quality assurance.","PeriodicalId":19055,"journal":{"name":"Nature Reviews Cancer","volume":"26 5","pages":"351-363"},"PeriodicalIF":66.8000,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Reviews Cancer","FirstCategoryId":"3","ListUrlMain":"https://www.nature.com/articles/s41568-026-00912-4","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Synthetic data, generated through advanced artificial intelligence models, are gaining traction in healthcare research, particularly in high-stakes fields such as haematology and oncology. By replicating statistical properties, intervariable relationships and behaviours of real-world data, synthetic data sets can serve as valuable supplements or substitutes for conventional medical data. They offer the potential to overcome barriers to data access and sharing, democratize scientific discovery, and reduce the costs and failure rates of clinical trials. However, the lack of standardization in training data selection, model evaluation, bias mitigation, privacy preservation and quality assurance remain major challenges, limiting their reliability and safe application. In this Review, we explore the role of synthetic data in cancer research and clinical trials, present real-world examples of their use, critically examine limitations and pitfalls, and propose best practices to enhance fidelity, validity, fairness and utility. Although synthetic data are not a ‘silver bullet’ for the challenges of clinical research, with rigorous validation and oversight, they have the potential to transform data sharing, scientific collaboration and clinical trial design. Synthetic data generated by generative artificial intelligence models can serve as a substitute for real patient data. In this Review, Eckardt et al. discuss how synthetic data sets can overcome barriers to data access and sharing, democratize scientific discovery in cancer research, and reduce the costs and failure rates of cancer clinical trials. They also discuss how this will only become possible if we can overcome the challenges of a lack of standardization in training data selection, model evaluation, bias mitigation, privacy preservation and quality assurance.
期刊介绍:
Nature Reviews Cancer, a part of the Nature Reviews portfolio of journals, aims to be the premier source of reviews and commentaries for the scientific communities it serves. The correct abbreviation for abstracting and indexing purposes is Nat. Rev. Cancer. The international standard serial numbers (ISSN) for Nature Reviews Cancer are 1474-175X (print) and 1474-1768 (online). Unlike other journals, Nature Reviews Cancer does not have an external editorial board. Instead, all editorial decisions are made by a team of full-time professional editors who are PhD-level scientists. The journal publishes Research Highlights, Comments, Reviews, and Perspectives relevant to cancer researchers, ensuring that the articles reach the widest possible audience due to their broad scope.