The Case for Synthetic Images Generated by Artificial Intelligence

IF 9.9 1区医学 Q1 HEMATOLOGY

American Journal of Hematology Pub Date : 2025-07-23 DOI:10.1002/ajh.70019

Bingwen Eugene Fan, Stefan Winkler

{"title":"The Case for Synthetic Images Generated by Artificial Intelligence","authors":"Bingwen Eugene Fan, Stefan Winkler","doi":"10.1002/ajh.70019","DOIUrl":null,"url":null,"abstract":"The latest deep learning models boast powerful capabilities for data generation in various modalities, including most notably text and images. A recent editorial by Bucci and Parini [1] focused on the opportunities this creates for researchers acting in bad faith, using such tools for image falsification and scientific fraud. As authors cited in this context [2], we offer a necessary counterpoint guided by Melvin Kranzberg's first law: “Technology is neither good nor bad; nor is it neutral” [3]. There is no doubt that this new technology can lead to ethical challenges for users. The downsides, such as the easy production of “deep fake” images and videos, is clearly a worrying trend, posing ethical challenges. However, we must not forget the beneficial use cases of these generative tools with transformative scientific applications already advancing our field.The development of powerful diffusion models—which reverse information loss through noise addition to generate images that are very similar in distribution to the original dataset they were trained with—represents a technical breakthrough in synthetic image generation. This approach can be considered part of a set of more general techniques, collectively termed data augmentation, used to enhance training datasets through various transformations in order to increase the quantity and diversity of training images. Basic transformations include geometric distortions, color adjustment, noise injection, filtering, and others. More advanced methods based on deep neural networks have also been developed, such as style transfer, super-resolution, or in-painting [4]. Synthetic image generation is a natural next step in this process. When ethically deployed, augmentation and generation methods can significantly improve machine learning performance, robustness, and generalization.This is particularly vital in hematology, where three critical constraints converge: scarce protected patient data, limited rare-disease cohorts, and costly expert annotations. Here, synthetic images offer demonstrable solutions. Firstly, synthetic images of cells from bone marrow smears [5] and peripheral blood films [2, 6] enable cross-institutional collaboration without breaching patient confidentiality. Secondly, combining synthetic and real microscopic cell images enhances classification accuracy in diagnostics [7]. Lastly, augmented datasets reduce reliance on scarce annotated samples while improving model generalizability [4].We agree that malevolent use demands governance—clear labeling, provenance documentation, and algorithmic transparency are essential. Generative models remain imperfect; ensuring synthetic images are realistic, diverse, and non-inferential requires ongoing refinement across modalities [8]. Yet safeguards are advancing: the Nature portfolio of journals state that “Editors may use software to screen images for manipulation…. Editors may request the unprocessed data files to help in manuscript evaluation during the peer review process…. (and) recommend retaining unprocessed data and metadata files after publication, ideally archiving data in perpetuity” [9]. Rather than energy-intensive blockchain solutions, we advocate cryptographic hashing in existing repositories for tamper-proof traceability. We support tamper-proof traceability mechanisms to ensure data integrity. While cryptographic hashing within existing repositories is a necessary component, it is not sufficient by itself to guarantee provenance or prevent post hoc manipulation. Secure, immutable, and independently verifiable registries—such as lightweight blockchain [10] or distributed ledger technologies—can additionally provide public auditability and trusted timestamping. Efficient, low energy blockchain systems designed for hash registration are already in use in scientific data certification, and their adoption in synthetic image workflows would bolster both transparency and trust.In conclusion, scientists need to understand the ramifications of inconsiderate or even malevolent use of synthetically generated images. Governments, institutions journals, and regulatory bodies should provide clear frameworks and guidelines on what is appropriate, while actively promoting ethical applications that overcome data scarcity and privacy barriers. Where oversight for synthetic images is weak, it enables fraud; where governance is rigorous, it facilitates trust and progress. Clear labeling of such images, accurate description of data provenance, publicly auditable time stamping systems, as well as the open sourcing of datasets and algorithms become even more important in the face of these powerful new tools. Only through judicious stewardship can we ensure these tools fulfill their potential: not as instruments of deception, but as engines of discovery.Bingwen Eugene Fan and Stefan Winkler contributed to the creation of the manuscript.We declare no conflicts of interest. Bingwen Eugene Fan is supported by the National Medical Research Council (NMRC) Clinician Innovator Development Award (NMRC/CIDA19May-0004) and the NMRC Research Training Fellowship (RTF24jan-0017).The authors have nothing to report.The authors declare no conflicts of interest.","PeriodicalId":7724,"journal":{"name":"American Journal of Hematology","volume":"100 10","pages":"1910-1911"},"PeriodicalIF":9.9000,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ajh.70019","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Hematology","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/ajh.70019","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEMATOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

The latest deep learning models boast powerful capabilities for data generation in various modalities, including most notably text and images. A recent editorial by Bucci and Parini [1] focused on the opportunities this creates for researchers acting in bad faith, using such tools for image falsification and scientific fraud. As authors cited in this context [2], we offer a necessary counterpoint guided by Melvin Kranzberg's first law: “Technology is neither good nor bad; nor is it neutral” [3]. There is no doubt that this new technology can lead to ethical challenges for users. The downsides, such as the easy production of “deep fake” images and videos, is clearly a worrying trend, posing ethical challenges. However, we must not forget the beneficial use cases of these generative tools with transformative scientific applications already advancing our field.

The development of powerful diffusion models—which reverse information loss through noise addition to generate images that are very similar in distribution to the original dataset they were trained with—represents a technical breakthrough in synthetic image generation. This approach can be considered part of a set of more general techniques, collectively termed data augmentation, used to enhance training datasets through various transformations in order to increase the quantity and diversity of training images. Basic transformations include geometric distortions, color adjustment, noise injection, filtering, and others. More advanced methods based on deep neural networks have also been developed, such as style transfer, super-resolution, or in-painting [4]. Synthetic image generation is a natural next step in this process. When ethically deployed, augmentation and generation methods can significantly improve machine learning performance, robustness, and generalization.

This is particularly vital in hematology, where three critical constraints converge: scarce protected patient data, limited rare-disease cohorts, and costly expert annotations. Here, synthetic images offer demonstrable solutions. Firstly, synthetic images of cells from bone marrow smears [5] and peripheral blood films [2, 6] enable cross-institutional collaboration without breaching patient confidentiality. Secondly, combining synthetic and real microscopic cell images enhances classification accuracy in diagnostics [7]. Lastly, augmented datasets reduce reliance on scarce annotated samples while improving model generalizability [4].

We agree that malevolent use demands governance—clear labeling, provenance documentation, and algorithmic transparency are essential. Generative models remain imperfect; ensuring synthetic images are realistic, diverse, and non-inferential requires ongoing refinement across modalities [8]. Yet safeguards are advancing: the Nature portfolio of journals state that “Editors may use software to screen images for manipulation…. Editors may request the unprocessed data files to help in manuscript evaluation during the peer review process…. (and) recommend retaining unprocessed data and metadata files after publication, ideally archiving data in perpetuity” [9]. Rather than energy-intensive blockchain solutions, we advocate cryptographic hashing in existing repositories for tamper-proof traceability. We support tamper-proof traceability mechanisms to ensure data integrity. While cryptographic hashing within existing repositories is a necessary component, it is not sufficient by itself to guarantee provenance or prevent post hoc manipulation. Secure, immutable, and independently verifiable registries—such as lightweight blockchain [10] or distributed ledger technologies—can additionally provide public auditability and trusted timestamping. Efficient, low energy blockchain systems designed for hash registration are already in use in scientific data certification, and their adoption in synthetic image workflows would bolster both transparency and trust.

In conclusion, scientists need to understand the ramifications of inconsiderate or even malevolent use of synthetically generated images. Governments, institutions journals, and regulatory bodies should provide clear frameworks and guidelines on what is appropriate, while actively promoting ethical applications that overcome data scarcity and privacy barriers. Where oversight for synthetic images is weak, it enables fraud; where governance is rigorous, it facilitates trust and progress. Clear labeling of such images, accurate description of data provenance, publicly auditable time stamping systems, as well as the open sourcing of datasets and algorithms become even more important in the face of these powerful new tools. Only through judicious stewardship can we ensure these tools fulfill their potential: not as instruments of deception, but as engines of discovery.

Bingwen Eugene Fan and Stefan Winkler contributed to the creation of the manuscript.

We declare no conflicts of interest. Bingwen Eugene Fan is supported by the National Medical Research Council (NMRC) Clinician Innovator Development Award (NMRC/CIDA19May-0004) and the NMRC Research Training Fellowship (RTF24jan-0017).

The authors have nothing to report.

The authors declare no conflicts of interest.

查看原文本刊更多论文

人工智能生成合成图像的案例。

最新的深度学习模型拥有强大的能力，可以生成各种形式的数据，包括最明显的文本和图像。Bucci和Parini b[1]最近的一篇社论关注的是，这为研究人员的恶意行为创造了机会，他们利用这些工具伪造图像和科学欺诈。正如作者在这篇文章中引用的那样，我们在梅尔文·克兰兹伯格第一定律的指导下提供了一个必要的对应物：“技术既不好也不坏；它也不是中性的“b[3]”。毫无疑问，这项新技术会给用户带来道德上的挑战。它的缺点，比如容易制作出“深度造假”的图片和视频，显然是一种令人担忧的趋势，构成了道德挑战。然而，我们不能忘记这些生成工具的有益用例，它们具有变革性的科学应用，已经在推进我们的领域。强大的扩散模型的发展——通过添加噪声来逆转信息损失，从而生成与原始数据集在分布上非常相似的图像——代表了合成图像生成的技术突破。这种方法可以被认为是一组更通用的技术的一部分，统称为数据增强，用于通过各种转换来增强训练数据集，以增加训练图像的数量和多样性。基本的变换包括几何变形、颜色调整、噪声注入、滤波等。基于深度神经网络的更先进的方法也得到了发展，如风格转移、超分辨率或绘画中的[4]。合成图像生成是这个过程的下一步。当合乎道德地部署时，增强和生成方法可以显着提高机器学习性能，鲁棒性和泛化。这在血液学中尤其重要，因为在血液学中有三个关键的限制：缺乏受保护的患者数据、有限的罕见疾病队列和昂贵的专家注释。在这里，合成图像提供了可论证的解决方案。首先，骨髓涂片[5]和外周血膜的细胞合成图像[2,6]可以在不违反患者保密的情况下进行跨机构合作。其次，将合成与真实显微细胞图像相结合，提高了诊断中的分类准确率。最后，增强数据集减少了对稀缺带注释样本的依赖，同时提高了模型的可泛化性。我们同意恶意使用需要治理——清晰的标签、来源文档和算法透明度是必不可少的。生成模型仍然不完善；确保合成图像是真实的、多样的和非推理的，需要不断地改进各种模式[8]。然而，保护措施正在推进：《自然》杂志的期刊组合声明“编辑可能会使用软件筛选图像以进行操纵....”编辑可以要求未处理的数据文件在同行评审过程中帮助手稿评估....（和）建议在发布后保留未处理的数据和元数据文件，理想情况下将数据永久归档。与能源密集型区块链解决方案不同，我们提倡在现有存储库中进行加密散列，以实现防篡改的可追溯性。我们支持防篡改跟踪机制，以确保数据的完整性。虽然现有存储库中的加密散列是必要的组件，但它本身不足以保证来源或防止事后操作。安全、不可变且可独立验证的注册中心（例如轻量级区块链[10]或分布式账本技术）还可以提供公共可审计性和可信的时间戳。为哈希注册设计的高效、低能耗区块链系统已经在科学数据认证中使用，它们在合成图像工作流程中的采用将增强透明度和信任。总而言之，科学家们需要了解不顾他人甚至恶意使用合成图像的后果。政府、机构、期刊和监管机构应该提供明确的框架和指导方针，同时积极促进道德应用，克服数据稀缺和隐私障碍。在对合成图像的监管薄弱的地方，它会导致欺诈；在治理严格的地方，它会促进信任和进步。面对这些强大的新工具，清晰的图像标签、准确的数据来源描述、可公开审计的时间戳系统以及数据集和算法的开源变得更加重要。只有通过明智的管理，我们才能确保这些工具发挥它们的潜力：不是作为欺骗的工具，而是作为发现的引擎。Bingwen Eugene Fan和Stefan Winkler对手稿的创作做出了贡献。我们声明没有利益冲突。范炳文是国家医学研究委员会（NMRC）临床医生创新发展奖（NMRC/CIDA19May-0004）和NMRC研究培训奖学金（RTF24jan-0017）的资助对象。作者没有什么可报告的。作者声明无利益冲突。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

American Journal of Hematology 医学-血液学

CiteScore

15.70

自引率

3.90%

发文量

363

审稿时长

3-6 weeks

期刊介绍： The American Journal of Hematology offers extensive coverage of experimental and clinical aspects of blood diseases in humans and animal models. The journal publishes original contributions in both non-malignant and malignant hematological diseases, encompassing clinical and basic studies in areas such as hemostasis, thrombosis, immunology, blood banking, and stem cell biology. Clinical translational reports highlighting innovative therapeutic approaches for the diagnosis and treatment of hematological diseases are actively encouraged.The American Journal of Hematology features regular original laboratory and clinical research articles, brief research reports, critical reviews, images in hematology, as well as letters and correspondence.