Multimodal Image Dataset for AI-based Skin Cancer (MIDAS) Benchmarking

medRxiv - Dermatology Pub Date : 2024-06-28 DOI:10.1101/2024.06.27.24309562

Albert S Chiou, Jesutofunmi A Omiye, Haiwen Gui, Susan M Swetter, Justin M Ko, Brian Gastman, Joshua Arbesman, Zhou Ran Cai, Olivier Gevaert, Chris Sadee, Veronica M Rotemberg, Seung Seog Han, Philipp Tschandl, Meghan Dickman, Elizabeth Bailey, Gordon H Bae, Philip Bailin, Jennifer Boldrick, Kiana Yekrang, Peter Caroline, Jackson Hanna, Nicholas R Kurtansky, Jochen Weber, Niki A See, Michelle Phung, Marianna Gallegos, Roxana Daneshjou, Roberto Novoa

{"title":"Multimodal Image Dataset for AI-based Skin Cancer (MIDAS) Benchmarking","authors":"Albert S Chiou, Jesutofunmi A Omiye, Haiwen Gui, Susan M Swetter, Justin M Ko, Brian Gastman, Joshua Arbesman, Zhou Ran Cai, Olivier Gevaert, Chris Sadee, Veronica M Rotemberg, Seung Seog Han, Philipp Tschandl, Meghan Dickman, Elizabeth Bailey, Gordon H Bae, Philip Bailin, Jennifer Boldrick, Kiana Yekrang, Peter Caroline, Jackson Hanna, Nicholas R Kurtansky, Jochen Weber, Niki A See, Michelle Phung, Marianna Gallegos, Roxana Daneshjou, Roberto Novoa","doi":"10.1101/2024.06.27.24309562","DOIUrl":null,"url":null,"abstract":"With an estimated 3 billion people globally lacking access to dermatological care, technological solutions leveraging artificial intelligence (AI) have been proposed to improve access. Diagnostic AI algorithms, however, require high-quality datasets to allow development and testing, particularly those that enable evaluation of both unimodal and multimodal approaches. Currently, the majority of dermatology AI algorithms are built and tested on proprietary, siloed data, often from a single site and with only a single image type (i.e., clinical or dermoscopic). To address this, we developed and released the Melanoma Research Alliance Multimodal Image Dataset for AI-based Skin Cancer (MIDAS) dataset, the largest publicly available, prospectively-recruited, paired dermoscopic- and clinical image-based dataset of biopsy-proven and dermatopathology-labeled skin lesions. We explored model performance on real-world cases using four previously published state-of-the-art (SOTA) models and compared model-to-clinician diagnostic performance. We also assessed algorithm performance using clinical photography taken at different distances from the lesion to assess its influence across diagnostic categories. We prospectively enrolled 796 patients through an IRB-approved protocol with informed consent representing 1290 unique lesions and 3830 total images (including dermoscopic and clinical images taken at 15-cm and 30-cm distance). Images represented the diagnostic diversity of lesions seen in general dermatology, with malignant, benign, and inflammatory lesions that included melanocytic nevi (22%; n=234), invasive cutaneous melanomas (4%; n=46), and melanoma in situ (4%; n=47). When evaluating SOTA models using the MIDAS dataset, we observed performance reduction across all models compared to their previously published performance metrics, indicating challenges to generalizability of current SOTA algorithms. As a comparative baseline, the dermatologists performing biopsies were 79% accurate with their top-1 diagnosis at differentiating a malignant from benign lesion. For malignant lesions, algorithms performed better on images acquired at 15-cm compared to 30-cm distance while dermoscopic images yielded higher sensitivity compared to clinical images. Improving our understanding of the strengths and weaknesses of AI diagnostic algorithms is critical as these tools advance towards widespread clinical deployment. While many algorithms may report high performance metrics, caution should be taken due to the potential for overfitting to localized datasets. MIDAS's robust, multimodal, and diverse dataset allows researchers to evaluate algorithms on our real-world images and better assess their generalizability.","PeriodicalId":501385,"journal":{"name":"medRxiv - Dermatology","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Dermatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.06.27.24309562","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

With an estimated 3 billion people globally lacking access to dermatological care, technological solutions leveraging artificial intelligence (AI) have been proposed to improve access. Diagnostic AI algorithms, however, require high-quality datasets to allow development and testing, particularly those that enable evaluation of both unimodal and multimodal approaches. Currently, the majority of dermatology AI algorithms are built and tested on proprietary, siloed data, often from a single site and with only a single image type (i.e., clinical or dermoscopic). To address this, we developed and released the Melanoma Research Alliance Multimodal Image Dataset for AI-based Skin Cancer (MIDAS) dataset, the largest publicly available, prospectively-recruited, paired dermoscopic- and clinical image-based dataset of biopsy-proven and dermatopathology-labeled skin lesions. We explored model performance on real-world cases using four previously published state-of-the-art (SOTA) models and compared model-to-clinician diagnostic performance. We also assessed algorithm performance using clinical photography taken at different distances from the lesion to assess its influence across diagnostic categories. We prospectively enrolled 796 patients through an IRB-approved protocol with informed consent representing 1290 unique lesions and 3830 total images (including dermoscopic and clinical images taken at 15-cm and 30-cm distance). Images represented the diagnostic diversity of lesions seen in general dermatology, with malignant, benign, and inflammatory lesions that included melanocytic nevi (22%; n=234), invasive cutaneous melanomas (4%; n=46), and melanoma in situ (4%; n=47). When evaluating SOTA models using the MIDAS dataset, we observed performance reduction across all models compared to their previously published performance metrics, indicating challenges to generalizability of current SOTA algorithms. As a comparative baseline, the dermatologists performing biopsies were 79% accurate with their top-1 diagnosis at differentiating a malignant from benign lesion. For malignant lesions, algorithms performed better on images acquired at 15-cm compared to 30-cm distance while dermoscopic images yielded higher sensitivity compared to clinical images. Improving our understanding of the strengths and weaknesses of AI diagnostic algorithms is critical as these tools advance towards widespread clinical deployment. While many algorithms may report high performance metrics, caution should be taken due to the potential for overfitting to localized datasets. MIDAS's robust, multimodal, and diverse dataset allows researchers to evaluate algorithms on our real-world images and better assess their generalizability.

查看原文本刊更多论文

基于人工智能的皮肤癌多模态图像数据集（MIDAS）基准测试

据估计，全球有 30 亿人无法获得皮肤病治疗，因此有人提出了利用人工智能（AI）的技术解决方案，以改善获得治疗的机会。然而，人工智能诊断算法需要高质量的数据集来进行开发和测试，特别是那些能够评估单模态和多模态方法的数据集。目前，大多数皮肤病学人工智能算法都是在专有、孤立的数据基础上构建和测试的，这些数据通常来自单一网站，且只有单一图像类型（即临床或皮肤镜）。为了解决这个问题，我们开发并发布了黑色素瘤研究联盟基于人工智能的皮肤癌多模态图像数据集（MIDAS）数据集，该数据集是最大的公开可用、前瞻性招募、基于皮肤镜和临床图像的活检证实和皮肤病理学标记的皮肤病变配对数据集。我们使用之前发布的四种最先进的（SOTA）模型探讨了模型在真实病例中的性能，并比较了模型与临床医生的诊断性能。我们还使用与皮损不同距离拍摄的临床照片评估了算法性能，以评估其对不同诊断类别的影响。我们通过获得知情同意的 IRB 批准方案前瞻性地招募了 796 名患者，代表了 1290 个独特的病变和 3830 张总图像（包括在 15 厘米和 30 厘米距离处拍摄的皮肤镜和临床图像）。图像代表了普通皮肤科病变诊断的多样性，包括恶性、良性和炎症性病变，其中包括黑素细胞痣（22%；n=234）、浸润性皮肤黑素瘤（4%；n=46）和原位黑素瘤（4%；n=47）。在使用MIDAS数据集评估SOTA模型时，我们发现与之前公布的性能指标相比，所有模型的性能都有所下降，这表明目前的SOTA算法的通用性面临挑战。作为比较基线，进行活组织检查的皮肤科医生在区分恶性和良性病变时，其 Top-1 诊断的准确率为 79%。对于恶性病变，算法在 15 厘米距离采集的图像上比在 30 厘米距离采集的图像上表现更好，而皮肤镜图像比临床图像的灵敏度更高。随着人工智能诊断算法在临床上的广泛应用，提高我们对这些算法优缺点的认识至关重要。虽然许多算法可能会报告较高的性能指标，但由于可能会过度拟合局部数据集，因此应谨慎行事。MIDAS 强大、多模态和多样化的数据集让研究人员能够在我们的真实世界图像上评估算法，并更好地评估其通用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

medRxiv - Dermatology

自引率

0.00%

发文量