Cheng Chen , Juzheng Miao , Dufan Wu , Aoxiao Zhong , Zhiling Yan , Sekeun Kim , Jiang Hu , Zhengliang Liu , Lichao Sun , Xiang Li , Tianming Liu , Pheng-Ann Heng , Quanzheng Li
{"title":"MA-SAM: Modality-agnostic SAM adaptation for 3D medical image segmentation","authors":"Cheng Chen , Juzheng Miao , Dufan Wu , Aoxiao Zhong , Zhiling Yan , Sekeun Kim , Jiang Hu , Zhengliang Liu , Lichao Sun , Xiang Li , Tianming Liu , Pheng-Ann Heng , Quanzheng Li","doi":"10.1016/j.media.2024.103310","DOIUrl":"10.1016/j.media.2024.103310","url":null,"abstract":"<div><p>The Segment Anything Model (SAM), a foundation model for general image segmentation, has demonstrated impressive zero-shot performance across numerous natural image segmentation tasks. However, SAM’s performance significantly declines when applied to medical images, primarily due to the substantial disparity between natural and medical image domains. To effectively adapt SAM to medical images, it is important to incorporate critical third-dimensional information, i.e., volumetric or temporal knowledge, during fine-tuning. Simultaneously, we aim to harness SAM’s pre-trained weights within its original 2D backbone to the fullest extent. In this paper, we introduce a modality-agnostic SAM adaptation framework, named as MA-SAM, that is applicable to various volumetric and video medical data. Our method roots in the parameter-efficient fine-tuning strategy to update only a small portion of weight increments while preserving the majority of SAM’s pre-trained weights. By injecting a series of 3D adapters into the transformer blocks of the image encoder, our method enables the pre-trained 2D backbone to extract third-dimensional information from input data. We comprehensively evaluate our method on five medical image segmentation tasks, by using 11 public datasets across CT, MRI, and surgical video data. Remarkably, without using any prompt, our method consistently outperforms various state-of-the-art 3D approaches, surpassing nnU-Net by 0.9%, 2.6%, and 9.9% in Dice for CT multi-organ segmentation, MRI prostate segmentation, and surgical scene segmentation respectively. Our model also demonstrates strong generalization, and excels in challenging tumor segmentation when prompts are used. Our code is available at: <span><span>https://github.com/cchen-cc/MA-SAM</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103310"},"PeriodicalIF":10.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142045900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain adaptation strategies for 3D reconstruction of the lumbar spine using real fluoroscopy data","authors":"Sascha Jecklin , Youyang Shen , Amandine Gout , Daniel Suter , Lilian Calvet , Lukas Zingg , Jennifer Straub , Nicola Alessandro Cavalcanti , Mazda Farshad , Philipp Fürnstahl , Hooman Esfandiari","doi":"10.1016/j.media.2024.103322","DOIUrl":"10.1016/j.media.2024.103322","url":null,"abstract":"<div><p>In this study, we address critical barriers hindering the widespread adoption of surgical navigation in orthopedic surgeries due to limitations such as time constraints, cost implications, radiation concerns, and integration within the surgical workflow. Recently, our work X23D showed an approach for generating 3D anatomical models of the spine from only a few intraoperative fluoroscopic images. This approach negates the need for conventional registration-based surgical navigation by creating a direct intraoperative 3D reconstruction of the anatomy. Despite these strides, the practical application of X23D has been limited by a significant domain gap between synthetic training data and real intraoperative images.</p><p>In response, we devised a novel data collection protocol to assemble a paired dataset consisting of synthetic and real fluoroscopic images captured from identical perspectives. Leveraging this unique dataset, we refined our deep learning model through transfer learning, effectively bridging the domain gap between synthetic and real X-ray data. We introduce an innovative approach combining style transfer with the curated paired dataset. This method transforms real X-ray images into the synthetic domain, enabling the <em>in-silico</em>-trained X23D model to achieve high accuracy in real-world settings.</p><p>Our results demonstrated that the refined model can rapidly generate accurate 3D reconstructions of the entire lumbar spine from as few as three intraoperative fluoroscopic shots. The enhanced model reached a sufficient accuracy, achieving an 84% F1 score, equating to the benchmark set solely by synthetic data in previous research. Moreover, with an impressive computational time of just 81.1 ms, our approach offers real-time capabilities, vital for successful integration into active surgical procedures.</p><p>By investigating optimal imaging setups and view angle dependencies, we have further validated the practicality and reliability of our system in a clinical environment. Our research represents a promising advancement in intraoperative 3D reconstruction. This innovation has the potential to enhance intraoperative surgical planning, navigation, and surgical robotics.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103322"},"PeriodicalIF":10.7,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1361841524002470/pdfft?md5=c8d17bbbaa45287c29e84ed636f09188&pid=1-s2.0-S1361841524002470-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142083433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Editorial for the Special Issue on the 2022 Medical Imaging with Deep Learning Conference","authors":"Shadi Albarqouni , Christian Baumgartner , Qi Dou , Ender Konukoglu , Bjoern Menze , Archana Venkataraman","doi":"10.1016/j.media.2024.103308","DOIUrl":"10.1016/j.media.2024.103308","url":null,"abstract":"","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103308"},"PeriodicalIF":10.7,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142100754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongbo Chen , Logiraj Kumaralingam , Shuhang Zhang , Sheng Song , Fayi Zhang , Haibin Zhang , Thanh-Tu Pham , Kumaradevan Punithakumar , Edmond H.M. Lou , Yuyao Zhang , Lawrence H. Le , Rui Zheng
{"title":"Neural implicit surface reconstruction of freehand 3D ultrasound volume with geometric constraints","authors":"Hongbo Chen , Logiraj Kumaralingam , Shuhang Zhang , Sheng Song , Fayi Zhang , Haibin Zhang , Thanh-Tu Pham , Kumaradevan Punithakumar , Edmond H.M. Lou , Yuyao Zhang , Lawrence H. Le , Rui Zheng","doi":"10.1016/j.media.2024.103305","DOIUrl":"10.1016/j.media.2024.103305","url":null,"abstract":"<div><p>Three-dimensional (3D) freehand ultrasound (US) is a widely used imaging modality that allows non-invasive imaging of medical anatomy without radiation exposure. Surface reconstruction of US volume is vital to acquire the accurate anatomical structures needed for modeling, registration, and visualization. However, traditional methods cannot produce a high-quality surface due to image noise. Despite improvements in smoothness, continuity, and resolution from deep learning approaches, research on surface reconstruction in freehand 3D US is still limited. This study introduces FUNSR, a self-supervised neural implicit surface reconstruction method to learn signed distance functions (SDFs) from US volumes. In particular, FUNSR iteratively learns the SDFs by moving the 3D queries sampled around volumetric point clouds to approximate the surface, guided by two novel geometric constraints: sign consistency constraint and on-surface constraint with adversarial learning. Our approach has been thoroughly evaluated across four datasets to demonstrate its adaptability to various anatomical structures, including a hip phantom dataset, two vascular datasets and one publicly available prostate dataset. We also show that smooth and continuous representations greatly enhance the visual appearance of US data. Furthermore, we highlight the potential of our method to improve segmentation performance, and its robustness to noise distribution and motion perturbation.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103305"},"PeriodicalIF":10.7,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142011450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Fu , Shunjie Dong , Yanyan Huang , Meng Niu , Chao Ni , Lequan Yu , Kuangyu Shi , Zhijun Yao , Cheng Zhuo
{"title":"MPGAN: Multi Pareto Generative Adversarial Network for the denoising and quantitative analysis of low-dose PET images of human brain","authors":"Yu Fu , Shunjie Dong , Yanyan Huang , Meng Niu , Chao Ni , Lequan Yu , Kuangyu Shi , Zhijun Yao , Cheng Zhuo","doi":"10.1016/j.media.2024.103306","DOIUrl":"10.1016/j.media.2024.103306","url":null,"abstract":"<div><p>Positron emission tomography (PET) imaging is widely used in medical imaging for analyzing neurological disorders and related brain diseases. Usually, full-dose imaging for PET ensures image quality but raises concerns about potential health risks of radiation exposure. The contradiction between reducing radiation exposure and maintaining diagnostic performance can be effectively addressed by reconstructing low-dose PET (L-PET) images to the same high-quality as full-dose (F-PET). This paper introduces the Multi Pareto Generative Adversarial Network (MPGAN) to achieve 3D end-to-end denoising for the L-PET images of human brain. MPGAN consists of two key modules: the diffused multi-round cascade generator (<span><math><msub><mrow><mi>G</mi></mrow><mrow><mi>D</mi><mi>m</mi><mi>c</mi></mrow></msub></math></span>) and the dynamic Pareto-efficient discriminator (<span><math><msub><mrow><mi>D</mi></mrow><mrow><mi>P</mi><mi>e</mi><mi>d</mi></mrow></msub></math></span>), both of which play a zero-sum game for <span><math><mrow><mi>n</mi><mspace></mspace><mrow><mo>(</mo><mi>n</mi><mo>∈</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mn>3</mn><mo>)</mo></mrow></mrow></math></span> rounds to ensure the quality of synthesized F-PET images. The Pareto-efficient dynamic discrimination process is introduced in <span><math><msub><mrow><mi>D</mi></mrow><mrow><mi>P</mi><mi>e</mi><mi>d</mi></mrow></msub></math></span> to adaptively adjust the weights of sub-discriminators for improved discrimination output. We validated the performance of MPGAN using three datasets, including two independent datasets and one mixed dataset, and compared it with 12 recent competing models. Experimental results indicate that the proposed MPGAN provides an effective solution for 3D end-to-end denoising of L-PET images of the human brain, which meets clinical standards and achieves state-of-the-art performance on commonly used metrics.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103306"},"PeriodicalIF":10.7,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142009062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking masked image modelling for medical image representation","authors":"Yutong Xie , Lin Gu , Tatsuya Harada , Jianpeng Zhang , Yong Xia , Qi Wu","doi":"10.1016/j.media.2024.103304","DOIUrl":"10.1016/j.media.2024.103304","url":null,"abstract":"<div><p>Masked Image Modelling (MIM), a form of self-supervised learning, has garnered significant success in computer vision by improving image representations using unannotated data. Traditional MIMs typically employ a strategy of random sampling across the image. However, this random masking technique may not be ideally suited for medical imaging, which possesses distinct characteristics divergent from natural images. In medical imaging, particularly in pathology, disease-related features are often exceedingly sparse and localized, while the remaining regions appear normal and undifferentiated. Additionally, medical images frequently accompany reports, directly pinpointing pathological changes’ location. Inspired by this, we propose <strong>M</strong>asked m<strong>ed</strong>ical <strong>I</strong>mage <strong>M</strong>odelling (MedIM), a novel approach, to our knowledge, the first research that employs radiological reports to guide the masking and restore the informative areas of images, encouraging the network to explore the stronger semantic representations from medical images. We introduce two mutual comprehensive masking strategies, knowledge-driven masking (KDM), and sentence-driven masking (SDM). KDM uses Medical Subject Headings (MeSH) words unique to radiology reports to identify symptom clues mapped to MeSH words (<em>e.g.</em>, cardiac, edema, vascular, pulmonary) and guide the mask generation. Recognizing that radiological reports often comprise several sentences detailing varied findings, SDM integrates sentence-level information to identify key regions for masking. MedIM reconstructs images informed by this masking from the KDM and SDM modules, promoting a comprehensive and enriched medical image representation. Our extensive experiments on seven downstream tasks covering multi-label/class image classification, pneumothorax segmentation, and medical image–report analysis, demonstrate that MedIM with report-guided masking achieves competitive performance. Our method substantially outperforms ImageNet pre-training, MIM-based pre-training, and medical image–report pre-training counterparts. Codes are available at <span><span>https://github.com/YtongXie/MedIM</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103304"},"PeriodicalIF":10.7,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1361841524002299/pdfft?md5=3f1249842080ca268c74cdfa823a2939&pid=1-s2.0-S1361841524002299-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142036285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal representations of biomedical knowledge from limited training whole slide images and reports using deep learning","authors":"Niccolò Marini , Stefano Marchesin , Marek Wodzinski , Alessandro Caputo , Damian Podareanu , Bryan Cardenas Guevara , Svetla Boytcheva , Simona Vatrano , Filippo Fraggetta , Francesco Ciompi , Gianmaria Silvello , Henning Müller , Manfredo Atzori","doi":"10.1016/j.media.2024.103303","DOIUrl":"10.1016/j.media.2024.103303","url":null,"abstract":"<div><p>The increasing availability of biomedical data creates valuable resources for developing new deep learning algorithms to support experts, especially in domains where collecting large volumes of annotated data is not trivial. Biomedical data include several modalities containing complementary information, such as medical images and reports: images are often large and encode low-level information, while reports include a summarized high-level description of the findings identified within data and often only concerning a small part of the image. However, only a few methods allow to effectively link the visual content of images with the textual content of reports, preventing medical specialists from properly benefitting from the recent opportunities offered by deep learning models. This paper introduces a multimodal architecture creating a robust biomedical data representation encoding fine-grained text representations within image embeddings. The architecture aims to tackle data scarcity (combining supervised and self-supervised learning) and to create multimodal biomedical ontologies. The architecture is trained on over 6,000 colon whole slide Images (WSI), paired with the corresponding report, collected from two digital pathology workflows. The evaluation of the multimodal architecture involves three tasks: WSI classification (on data from pathology workflow and from public repositories), multimodal data retrieval, and linking between textual and visual concepts. Noticeably, the latter two tasks are available by architectural design without further training, showing that the multimodal architecture that can be adopted as a backbone to solve peculiar tasks. The multimodal data representation outperforms the unimodal one on the classification of colon WSIs and allows to halve the data needed to reach accurate performance, reducing the computational power required and thus the carbon footprint. The combination of images and reports exploiting self-supervised algorithms allows to mine databases without needing new annotations provided by experts, extracting new information. In particular, the multimodal visual ontology, linking semantic concepts to images, may pave the way to advancements in medicine and biomedical analysis domains, not limited to histopathology.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"97 ","pages":"Article 103303"},"PeriodicalIF":10.7,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1361841524002287/pdfft?md5=73a7966410c3f9ed908cd48c6bfefa5b&pid=1-s2.0-S1361841524002287-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142000329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinchi Zhou , Tianqi Chen , Jun Hou , Huidong Xie , Nicha C. Dvornek , S. Kevin Zhou , David L. Wilson , James S. Duncan , Chi Liu , Bo Zhou
{"title":"Cascaded Multi-path Shortcut Diffusion Model for Medical Image Translation","authors":"Yinchi Zhou , Tianqi Chen , Jun Hou , Huidong Xie , Nicha C. Dvornek , S. Kevin Zhou , David L. Wilson , James S. Duncan , Chi Liu , Bo Zhou","doi":"10.1016/j.media.2024.103300","DOIUrl":"10.1016/j.media.2024.103300","url":null,"abstract":"<div><p>Image-to-image translation is a vital component in medical imaging processing, with many uses in a wide range of imaging modalities and clinical scenarios. Previous methods include Generative Adversarial Networks (GANs) and Diffusion Models (DMs), which offer realism but suffer from instability and lack uncertainty estimation. Even though both GAN and DM methods have individually exhibited their capability in medical image translation tasks, the potential of combining a GAN and DM to further improve translation performance and to enable uncertainty estimation remains largely unexplored. In this work, we address these challenges by proposing a Cascade Multi-path Shortcut Diffusion Model (CMDM) for high-quality medical image translation and uncertainty estimation. To reduce the required number of iterations and ensure robust performance, our method first obtains a conditional GAN-generated prior image that will be used for the efficient reverse translation with a DM in the subsequent step. Additionally, a multi-path shortcut diffusion strategy is employed to refine translation results and estimate uncertainty. A cascaded pipeline further enhances translation quality, incorporating residual averaging between cascades. We collected three different medical image datasets with two sub-tasks for each dataset to test the generalizability of our approach. Our experimental results found that CMDM can produce high-quality translations comparable to state-of-the-art methods while providing reasonable uncertainty estimations that correlate well with the translation error.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103300"},"PeriodicalIF":10.7,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142122904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weijian Huang , Cheng Li , Hao Yang , Jiarun Liu , Yong Liang , Hairong Zheng , Shanshan Wang
{"title":"Enhancing the vision–language foundation model with key semantic knowledge-emphasized report refinement","authors":"Weijian Huang , Cheng Li , Hao Yang , Jiarun Liu , Yong Liang , Hairong Zheng , Shanshan Wang","doi":"10.1016/j.media.2024.103299","DOIUrl":"10.1016/j.media.2024.103299","url":null,"abstract":"<div><p>Recently, vision–language representation learning has made remarkable advancements in building up medical foundation models, holding immense potential for transforming the landscape of clinical research and medical care. The underlying hypothesis is that the rich knowledge embedded in radiology reports can effectively assist and guide the learning process, reducing the need for additional labels. However, these reports tend to be complex and sometimes even consist of redundant descriptions that make the representation learning too challenging to capture the key semantic information. This paper develops a novel iterative vision–language representation learning framework by proposing a key semantic knowledge-emphasized report refinement method. Particularly, raw radiology reports are refined to highlight the key information according to a constructed clinical dictionary and two model-optimized knowledge-enhancement metrics. The iterative framework is designed to progressively learn, starting from gaining a general understanding of the patient’s condition based on raw reports and gradually refines and extracts critical information essential to the fine-grained analysis tasks. The effectiveness of the proposed framework is validated on various downstream medical image analysis tasks, including disease classification, region-of-interest segmentation, and phrase grounding. Our framework surpasses seven state-of-the-art methods in both fine-tuning and zero-shot settings, demonstrating its encouraging potential for different clinical applications.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"97 ","pages":"Article 103299"},"PeriodicalIF":10.7,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141988237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tim G.W. Boers , Kiki N. Fockens , Joost A. van der Putten , Tim J.M. Jaspers , Carolus H.J. Kusters , Jelmer B. Jukema , Martijn R. Jong , Maarten R. Struyvenberg , Jeroen de Groof , Jacques J. Bergman , Peter H.N. de With , Fons van der Sommen
{"title":"Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency","authors":"Tim G.W. Boers , Kiki N. Fockens , Joost A. van der Putten , Tim J.M. Jaspers , Carolus H.J. Kusters , Jelmer B. Jukema , Martijn R. Jong , Maarten R. Struyvenberg , Jeroen de Groof , Jacques J. Bergman , Peter H.N. de With , Fons van der Sommen","doi":"10.1016/j.media.2024.103298","DOIUrl":"10.1016/j.media.2024.103298","url":null,"abstract":"<div><p>Pre-training deep learning models with large data sets of natural images, such as ImageNet, has become the standard for endoscopic image analysis. This approach is generally superior to <em>training from scratch</em>, due to the scarcity of high-quality medical imagery and labels. However, it is still unknown whether the learned features on natural imagery provide an optimal starting point for the downstream medical endoscopic imaging tasks. Intuitively, pre-training with imagery closer to the target domain could lead to better-suited feature representations. This study evaluates whether leveraging in-domain pre-training in gastrointestinal endoscopic image analysis has potential benefits compared to pre-training on natural images.</p><p>To this end, we present a dataset comprising of 5,014,174 gastrointestinal endoscopic images from eight different medical centers (GastroNet-5M), and exploit self-supervised learning with SimCLRv2, MoCov2 and DINO to learn relevant features for in-domain downstream tasks. The learned features are compared to features learned on natural images derived with multiple methods, and variable amounts of data and/or labels (e.g. Billion-scale semi-weakly supervised learning and supervised learning on ImageNet-21k). The effects of the evaluation is performed on five downstream data sets, particularly designed for a variety of gastrointestinal tasks, for example, GIANA for angiodyplsia detection and Kvasir-SEG for polyp segmentation.</p><p>The findings indicate that self-supervised domain-specific pre-training, specifically using the DINO framework, results into better performing models compared to any supervised pre-training on natural images. On the ResNet50 and Vision-Transformer-small architectures, utilizing self-supervised in-domain pre-training with DINO leads to an average performance boost of 1.63% and 4.62%, respectively, on the downstream datasets. This improvement is measured against the best performance achieved through pre-training on natural images within any of the evaluated frameworks.</p><p>Moreover, the in-domain pre-trained models also exhibit increased robustness against distortion perturbations (noise, contrast, blur, etc.), where the in-domain pre-trained ResNet50 and Vision-Transformer-small with DINO achieved on average 1.28% and 3.55% higher on the performance metrics, compared to the best performance found for pre-trained models on natural images.</p><p>Overall, this study highlights the importance of in-domain pre-training for improving the generic nature, scalability and performance of deep learning for medical image analysis. The GastroNet-5M pre-trained weights are made publicly available in our repository: <span><span>huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103298"},"PeriodicalIF":10.7,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1361841524002238/pdfft?md5=25ff2f1e7dfbb3491c0a72c80dc8e023&pid=1-s2.0-S1361841524002238-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142021367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}