Alessio Luschi , Linda Tognetti , Alessandra Cartocci , Elisa Cinotti , Giovanni Rubegni , Laura Calabrese , Martina D’onghia , Martina Dragotto , Elvira Moscarella , Gabriella Brancaccio , Giulia Briatico , Camila Scharf , Dario Buononato , Vittorio Tancredi , Carmen Cantisani , Camilla Chello , Luca Ambrosio , Pietro Scribani Rossi , Marco Virone , Giovanni Pellacani , Ernesto Iadanza
{"title":"Design and development of a systematic validation protocol for synthetic melanoma images for responsible use in medical artificial intelligence","authors":"Alessio Luschi , Linda Tognetti , Alessandra Cartocci , Elisa Cinotti , Giovanni Rubegni , Laura Calabrese , Martina D’onghia , Martina Dragotto , Elvira Moscarella , Gabriella Brancaccio , Giulia Briatico , Camila Scharf , Dario Buononato , Vittorio Tancredi , Carmen Cantisani , Camilla Chello , Luca Ambrosio , Pietro Scribani Rossi , Marco Virone , Giovanni Pellacani , Ernesto Iadanza","doi":"10.1016/j.bbe.2025.09.001","DOIUrl":null,"url":null,"abstract":"<div><div>Malignant melanoma is the deadliest form of skin cancer, and artificial intelligence could help address its diagnostic challenges. Generative Adversarial Networks (GANs) can generate synthetic dermoscopic images to augment limited real datasets, but the lack of standardised validation protocols holds back models’ reliability and clinicians’ trust. This study aims to design and develop a systematic validation protocol combining quantitative metrics and qualitative expert assessments to evaluate the realism, fidelity, diversity, and usefulness of synthetic dermoscopic melanoma images. A StyleGAN2 model, designed and trained in a previous study, was selected for its superior quantitative performance and exploited to generate 25 synthetic melanoma images, matched with 25 real images. A panel of 17 dermoscopists assessed the images using a 7-point Likert scale, across multiple qualitative attributes (real vs. synthetic, skin texture, visual realism, and confidence) and pattern analysis. Accuracy, sensitivity, specificity, Fleiss’ Kappa, and Krippendorff’s Alpha were calculated to analyse inter-rater agreement and evaluation outcomes. Accuracy in real vs synthetic images classification was moderate (64 %), with sensitivity at 73 % and specificity at 56 %, with poor inter-rater concordance over qualitative attributes. Synthetic images obtained superior scores in medium visual and overall realism, and confidence level, while the frequency of recognition of pigment network-patterns was comparable with real images. The proposed holistic validation protocol can effectively estimate the quality level of synthetic dermoscopic images, regardless of the architecture of the model used for generation, offering an objective and reliable evaluation tool, as qualitative evaluations remain crucial to ensure their safe deployment in clinical settings.</div></div>","PeriodicalId":55381,"journal":{"name":"Biocybernetics and Biomedical Engineering","volume":"45 4","pages":"Pages 608-616"},"PeriodicalIF":6.6000,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biocybernetics and Biomedical Engineering","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S020852162500066X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Malignant melanoma is the deadliest form of skin cancer, and artificial intelligence could help address its diagnostic challenges. Generative Adversarial Networks (GANs) can generate synthetic dermoscopic images to augment limited real datasets, but the lack of standardised validation protocols holds back models’ reliability and clinicians’ trust. This study aims to design and develop a systematic validation protocol combining quantitative metrics and qualitative expert assessments to evaluate the realism, fidelity, diversity, and usefulness of synthetic dermoscopic melanoma images. A StyleGAN2 model, designed and trained in a previous study, was selected for its superior quantitative performance and exploited to generate 25 synthetic melanoma images, matched with 25 real images. A panel of 17 dermoscopists assessed the images using a 7-point Likert scale, across multiple qualitative attributes (real vs. synthetic, skin texture, visual realism, and confidence) and pattern analysis. Accuracy, sensitivity, specificity, Fleiss’ Kappa, and Krippendorff’s Alpha were calculated to analyse inter-rater agreement and evaluation outcomes. Accuracy in real vs synthetic images classification was moderate (64 %), with sensitivity at 73 % and specificity at 56 %, with poor inter-rater concordance over qualitative attributes. Synthetic images obtained superior scores in medium visual and overall realism, and confidence level, while the frequency of recognition of pigment network-patterns was comparable with real images. The proposed holistic validation protocol can effectively estimate the quality level of synthetic dermoscopic images, regardless of the architecture of the model used for generation, offering an objective and reliable evaluation tool, as qualitative evaluations remain crucial to ensure their safe deployment in clinical settings.
期刊介绍:
Biocybernetics and Biomedical Engineering is a quarterly journal, founded in 1981, devoted to publishing the results of original, innovative and creative research investigations in the field of Biocybernetics and biomedical engineering, which bridges mathematical, physical, chemical and engineering methods and technology to analyse physiological processes in living organisms as well as to develop methods, devices and systems used in biology and medicine, mainly in medical diagnosis, monitoring systems and therapy. The Journal''s mission is to advance scientific discovery into new or improved standards of care, and promotion a wide-ranging exchange between science and its application to humans.