Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2024-10-30 DOI:10.1186/s13040-024-00397-7

Vanesa Gómez-Martínez, David Chushig-Muzo, Marit B Veierød, Conceição Granja, Cristina Soguero-Ruiz

{"title":"Ensemble feature selection and tabular data augmentation with generative adversarial networks to enhance cutaneous melanoma identification and interpretability.","authors":"Vanesa Gómez-Martínez, David Chushig-Muzo, Marit B Veierød, Conceição Granja, Cristina Soguero-Ruiz","doi":"10.1186/s13040-024-00397-7","DOIUrl":null,"url":null,"abstract":"Background: Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.Methods: In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.Results: The combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features.Conclusions: Our results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"17 1","pages":"46"},"PeriodicalIF":6.1000,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526724/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-024-00397-7","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Cutaneous melanoma is the most aggressive form of skin cancer, responsible for most skin cancer-related deaths. Recent advances in artificial intelligence, jointly with the availability of public dermoscopy image datasets, have allowed to assist dermatologists in melanoma identification. While image feature extraction holds potential for melanoma detection, it often leads to high-dimensional data. Furthermore, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented.

Methods: In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, whereas for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.

Results: The combination of ensemble FS, CTGAN, and linear models achieved the best predictive results, achieving AUCROC values of 87% (with support vector machine and IR=0.9) and 76% (with LASSO and IR=1.0) for the PH2 and Derm7pt, respectively. We also identified that melanoma lesions were mainly characterized by features related to color, while not-melanoma lesions were characterized by texture features.

Conclusions: Our results demonstrate the effectiveness of ensemble FS and synthetic data in the development of models that accurately identify melanoma. This research advances skin lesion analysis, contributing to both melanoma detection and the interpretation of main features for its identification.

Abstract Image

查看原文本刊更多论文

利用生成式对抗网络进行集合特征选择和表格数据增强，以提高皮肤黑色素瘤的识别能力和可解释性。

背景：皮肤黑色素瘤是最具侵袭性的皮肤癌，是造成大多数皮肤癌相关死亡的原因。人工智能领域的最新进展，加上公共皮肤镜图像数据集的可用性，有助于皮肤科医生识别黑色素瘤。虽然图像特征提取在黑色素瘤检测方面具有潜力，但它往往会产生高维数据。此外，大多数图像数据集都存在类不平衡的问题，即少数几个类有大量样本，而其他类的代表性不足：本文建议将集合特征选择（FS）方法和数据增强与条件表生成对抗网络（CTGAN）相结合，以增强不平衡数据集中的黑色素瘤识别能力。我们采用了两个公开数据集 PH2 和 Derm7pt 中的皮肤镜图像，其中包含黑色素瘤和非黑色素瘤病变。为了捕捉皮肤病变的内在信息，我们采用了两种特征提取（FE）方法，包括手工特征提取和嵌入特征提取。对于前者，我们提取了颜色、几何和一阶、二阶及高阶纹理特征，而对于后者，我们使用基于 ResNet 的模型获得了嵌入特征。为了减轻 FE 的高维性，我们使用并评估了带有过滤器方法的集合 FS。在数据增强方面，我们对与合成样本量相关的不平衡率（IR）进行了渐进分析，并评估了其对预测结果的影响。为了获得预测模型的可解释性，我们使用了SHAP、自举重采样统计检验和UMAP可视化：结果：集合FS、CTGAN和线性模型的组合取得了最佳预测结果，PH2和Derm7pt的AUCROC值分别达到87%（支持向量机，IR=0.9）和76%（LASSO，IR=1.0）。我们还发现，黑色素瘤病变的主要特征是与颜色相关的特征，而非黑色素瘤病变的主要特征是纹理特征：我们的研究结果表明，在开发能准确识别黑色素瘤的模型时，集合FS和合成数据非常有效。这项研究推动了皮肤病变分析，有助于黑色素瘤的检测和主要特征的解释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.