Design and development of a systematic validation protocol for synthetic melanoma images for responsible use in medical artificial intelligence

IF 6.6 2区医学 Q1 ENGINEERING, BIOMEDICAL

Biocybernetics and Biomedical Engineering Pub Date : 2025-09-21 DOI:10.1016/j.bbe.2025.09.001

Alessio Luschi , Linda Tognetti , Alessandra Cartocci , Elisa Cinotti , Giovanni Rubegni , Laura Calabrese , Martina D’onghia , Martina Dragotto , Elvira Moscarella , Gabriella Brancaccio , Giulia Briatico , Camila Scharf , Dario Buononato , Vittorio Tancredi , Carmen Cantisani , Camilla Chello , Luca Ambrosio , Pietro Scribani Rossi , Marco Virone , Giovanni Pellacani , Ernesto Iadanza

{"title":"Design and development of a systematic validation protocol for synthetic melanoma images for responsible use in medical artificial intelligence","authors":"Alessio Luschi , Linda Tognetti , Alessandra Cartocci , Elisa Cinotti , Giovanni Rubegni , Laura Calabrese , Martina D’onghia , Martina Dragotto , Elvira Moscarella , Gabriella Brancaccio , Giulia Briatico , Camila Scharf , Dario Buononato , Vittorio Tancredi , Carmen Cantisani , Camilla Chello , Luca Ambrosio , Pietro Scribani Rossi , Marco Virone , Giovanni Pellacani , Ernesto Iadanza","doi":"10.1016/j.bbe.2025.09.001","DOIUrl":null,"url":null,"abstract":"<div><div>Malignant melanoma is the deadliest form of skin cancer, and artificial intelligence could help address its diagnostic challenges. Generative Adversarial Networks (GANs) can generate synthetic dermoscopic images to augment limited real datasets, but the lack of standardised validation protocols holds back models’ reliability and clinicians’ trust. This study aims to design and develop a systematic validation protocol combining quantitative metrics and qualitative expert assessments to evaluate the realism, fidelity, diversity, and usefulness of synthetic dermoscopic melanoma images. A StyleGAN2 model, designed and trained in a previous study, was selected for its superior quantitative performance and exploited to generate 25 synthetic melanoma images, matched with 25 real images. A panel of 17 dermoscopists assessed the images using a 7-point Likert scale, across multiple qualitative attributes (real vs. synthetic, skin texture, visual realism, and confidence) and pattern analysis. Accuracy, sensitivity, specificity, Fleiss’ Kappa, and Krippendorff’s Alpha were calculated to analyse inter-rater agreement and evaluation outcomes. Accuracy in real vs synthetic images classification was moderate (64 %), with sensitivity at 73 % and specificity at 56 %, with poor inter-rater concordance over qualitative attributes. Synthetic images obtained superior scores in medium visual and overall realism, and confidence level, while the frequency of recognition of pigment network-patterns was comparable with real images. The proposed holistic validation protocol can effectively estimate the quality level of synthetic dermoscopic images, regardless of the architecture of the model used for generation, offering an objective and reliable evaluation tool, as qualitative evaluations remain crucial to ensure their safe deployment in clinical settings.</div></div>","PeriodicalId":55381,"journal":{"name":"Biocybernetics and Biomedical Engineering","volume":"45 4","pages":"Pages 608-616"},"PeriodicalIF":6.6000,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biocybernetics and Biomedical Engineering","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S020852162500066X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Malignant melanoma is the deadliest form of skin cancer, and artificial intelligence could help address its diagnostic challenges. Generative Adversarial Networks (GANs) can generate synthetic dermoscopic images to augment limited real datasets, but the lack of standardised validation protocols holds back models’ reliability and clinicians’ trust. This study aims to design and develop a systematic validation protocol combining quantitative metrics and qualitative expert assessments to evaluate the realism, fidelity, diversity, and usefulness of synthetic dermoscopic melanoma images. A StyleGAN2 model, designed and trained in a previous study, was selected for its superior quantitative performance and exploited to generate 25 synthetic melanoma images, matched with 25 real images. A panel of 17 dermoscopists assessed the images using a 7-point Likert scale, across multiple qualitative attributes (real vs. synthetic, skin texture, visual realism, and confidence) and pattern analysis. Accuracy, sensitivity, specificity, Fleiss’ Kappa, and Krippendorff’s Alpha were calculated to analyse inter-rater agreement and evaluation outcomes. Accuracy in real vs synthetic images classification was moderate (64 %), with sensitivity at 73 % and specificity at 56 %, with poor inter-rater concordance over qualitative attributes. Synthetic images obtained superior scores in medium visual and overall realism, and confidence level, while the frequency of recognition of pigment network-patterns was comparable with real images. The proposed holistic validation protocol can effectively estimate the quality level of synthetic dermoscopic images, regardless of the architecture of the model used for generation, offering an objective and reliable evaluation tool, as qualitative evaluations remain crucial to ensure their safe deployment in clinical settings.

查看原文本刊更多论文

设计和开发用于医疗人工智能的合成黑色素瘤图像的系统验证协议

恶性黑色素瘤是最致命的皮肤癌，人工智能可以帮助解决其诊断挑战。生成对抗网络（GANs）可以生成合成的皮肤镜图像来增强有限的真实数据集，但缺乏标准化的验证协议阻碍了模型的可靠性和临床医生的信任。本研究旨在设计和开发一种结合定量指标和定性专家评估的系统验证方案，以评估合成皮肤镜下黑色素瘤图像的真实感、保真度、多样性和有用性。在之前的研究中设计和训练的StyleGAN2模型因其优越的定量性能而被选中，并利用它生成25张合成黑色素瘤图像，与25张真实图像相匹配。一个由17名皮肤科医生组成的小组使用7分李克特量表评估图像，包括多个定性属性（真实与合成、皮肤纹理、视觉真实感和信心）和模式分析。计算准确性、敏感性、特异性、Fleiss Kappa和Krippendorff Alpha来分析评分者之间的一致性和评估结果。真实图像与合成图像分类的准确率为中等（64%），灵敏度为73%，特异性为56%，定性属性间一致性较差。合成图像在中等视觉和整体真实感以及置信度方面得分较高，而对色素网络模式的识别频率与真实图像相当。所提出的整体验证方案可以有效地估计合成皮肤镜图像的质量水平，而不考虑用于生成的模型的架构，提供客观可靠的评估工具，因为定性评估对于确保其在临床环境中的安全部署至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biocybernetics and Biomedical Engineering ENGINEERING, BIOMEDICAL-

CiteScore

16.50

自引率

6.20%

发文量

审稿时长

38 days

期刊介绍： Biocybernetics and Biomedical Engineering is a quarterly journal, founded in 1981, devoted to publishing the results of original, innovative and creative research investigations in the field of Biocybernetics and biomedical engineering, which bridges mathematical, physical, chemical and engineering methods and technology to analyse physiological processes in living organisms as well as to develop methods, devices and systems used in biology and medicine, mainly in medical diagnosis, monitoring systems and therapy. The Journal''s mission is to advance scientific discovery into new or improved standards of care, and promotion a wide-ranging exchange between science and its application to humans.