RemixFormer++: A Multi-Modal Transformer Model for Precision Skin Tumor Differential Diagnosis With Memory-Efficient Attention

IEEE transactions on medical imaging Pub Date : 2024-08-09 DOI:10.1109/TMI.2024.3441012

Jing Xu;Kai Huang;Lianzhen Zhong;Yuan Gao;Kai Sun;Wei Liu;Yanjie Zhou;Wenchao Guo;Yuan Guo;Yuanqiang Zou;Yuping Duan;Le Lu;Yu Wang;Xiang Chen;Shuang Zhao

{"title":"RemixFormer++: A Multi-Modal Transformer Model for Precision Skin Tumor Differential Diagnosis With Memory-Efficient Attention","authors":"Jing Xu;Kai Huang;Lianzhen Zhong;Yuan Gao;Kai Sun;Wei Liu;Yanjie Zhou;Wenchao Guo;Yuan Guo;Yuanqiang Zou;Yuping Duan;Le Lu;Yu Wang;Xiang Chen;Shuang Zhao","doi":"10.1109/TMI.2024.3441012","DOIUrl":null,"url":null,"abstract":"Diagnosing malignant skin tumors accurately at an early stage can be challenging due to ambiguous and even confusing visual characteristics displayed by various categories of skin tumors. To improve diagnosis precision, all available clinical data from multiple sources, particularly clinical images, dermoscopy images, and medical history, could be considered. Aligning with clinical practice, we propose a novel Transformer model, named RemixFormer++ that consists of a clinical image branch, a dermoscopy image branch, and a metadata branch. Given the unique characteristics inherent in clinical and dermoscopy images, specialized attention strategies are adopted for each type. Clinical images are processed through a top-down architecture, capturing both localized lesion details and global contextual information. Conversely, dermoscopy images undergo a bottom-up processing with two-level hierarchical encoders, designed to pinpoint fine-grained structural and textural features. A dedicated metadata branch seamlessly integrates non-visual information by encoding relevant patient data. Fusing features from three branches substantially boosts disease classification accuracy. RemixFormer++ demonstrates exceptional performance on four single-modality datasets (PAD-UFES-20, ISIC 2017/2018/2019). Compared with the previous best method using a public multi-modal Derm7pt dataset, we achieved an absolute 5.3% increase in averaged F1 and 1.2% in accuracy for the classification of five skin tumors. Furthermore, using a large-scale in-house dataset of 10,351 patients with the twelve most common skin tumors, our method obtained an overall classification accuracy of 92.6%. These promising results, on par or better with the performance of 191 dermatologists through a comprehensive reader study, evidently imply the potential clinical usability of our method.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 1","pages":"320-337"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10632195/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Diagnosing malignant skin tumors accurately at an early stage can be challenging due to ambiguous and even confusing visual characteristics displayed by various categories of skin tumors. To improve diagnosis precision, all available clinical data from multiple sources, particularly clinical images, dermoscopy images, and medical history, could be considered. Aligning with clinical practice, we propose a novel Transformer model, named RemixFormer++ that consists of a clinical image branch, a dermoscopy image branch, and a metadata branch. Given the unique characteristics inherent in clinical and dermoscopy images, specialized attention strategies are adopted for each type. Clinical images are processed through a top-down architecture, capturing both localized lesion details and global contextual information. Conversely, dermoscopy images undergo a bottom-up processing with two-level hierarchical encoders, designed to pinpoint fine-grained structural and textural features. A dedicated metadata branch seamlessly integrates non-visual information by encoding relevant patient data. Fusing features from three branches substantially boosts disease classification accuracy. RemixFormer++ demonstrates exceptional performance on four single-modality datasets (PAD-UFES-20, ISIC 2017/2018/2019). Compared with the previous best method using a public multi-modal Derm7pt dataset, we achieved an absolute 5.3% increase in averaged F1 and 1.2% in accuracy for the classification of five skin tumors. Furthermore, using a large-scale in-house dataset of 10,351 patients with the twelve most common skin tumors, our method obtained an overall classification accuracy of 92.6%. These promising results, on par or better with the performance of 191 dermatologists through a comprehensive reader study, evidently imply the potential clinical usability of our method.

查看原文本刊更多论文

RemixFormer++：用于精确皮肤肿瘤鉴别诊断的多模态变压器模型，具有记忆效率高的注意力。

由于各类皮肤肿瘤显示的视觉特征模糊不清，甚至容易混淆，因此在早期准确诊断恶性皮肤肿瘤具有挑战性。为了提高诊断的精确度，可以考虑从多个来源获取所有可用的临床数据，特别是临床图像、皮肤镜图像和病史。根据临床实践，我们提出了一种名为 Remix-Former++ 的新型转换器模型，它由临床图像分支、皮肤镜图像分支和元数据分支组成。鉴于临床图像和皮肤镜图像的固有特性，每种类型的图像都采用了专门的关注策略。临床图像通过自上而下的架构进行处理，同时捕捉局部病变细节和全局上下文信息。相反，皮肤镜图像则采用两级分层编码器进行自下而上的处理，旨在精确定位细粒度的结构和纹理特征。一个专门的元数据分支通过对相关患者数据进行编码，无缝整合了非视觉信息。融合三个分支的特征可大幅提高疾病分类的准确性。RemixFormer++ 在四个单模态数据集（PAD-UFES-20、ISIC 2017/2018/2019）上表现出卓越的性能。与之前使用公共多模态 Derm7pt 数据集的最佳方法相比，我们在对五种皮肤肿瘤进行分类时，平均 F1 绝对值提高了 5.3%，准确率提高了 1.2%。此外，在使用由 10351 名患有 12 种最常见皮肤肿瘤的患者组成的大规模内部数据集时，我们的方法获得了 92.6% 的总体分类准确率。这些令人鼓舞的结果与 191 位皮肤科医生通过综合读者研究得出的结果相当或更好，这显然意味着我们的方法具有潜在的临床实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on medical imaging

自引率

0.00%

发文量