胃肠道内窥镜人工智能中的基础模型:架构、预训练方法和数据效率的影响

IF 10.7 1区 医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Tim G.W. Boers , Kiki N. Fockens , Joost A. van der Putten , Tim J.M. Jaspers , Carolus H.J. Kusters , Jelmer B. Jukema , Martijn R. Jong , Maarten R. Struyvenberg , Jeroen de Groof , Jacques J. Bergman , Peter H.N. de With , Fons van der Sommen
{"title":"胃肠道内窥镜人工智能中的基础模型:架构、预训练方法和数据效率的影响","authors":"Tim G.W. Boers ,&nbsp;Kiki N. Fockens ,&nbsp;Joost A. van der Putten ,&nbsp;Tim J.M. Jaspers ,&nbsp;Carolus H.J. Kusters ,&nbsp;Jelmer B. Jukema ,&nbsp;Martijn R. Jong ,&nbsp;Maarten R. Struyvenberg ,&nbsp;Jeroen de Groof ,&nbsp;Jacques J. Bergman ,&nbsp;Peter H.N. de With ,&nbsp;Fons van der Sommen","doi":"10.1016/j.media.2024.103298","DOIUrl":null,"url":null,"abstract":"<div><p>Pre-training deep learning models with large data sets of natural images, such as ImageNet, has become the standard for endoscopic image analysis. This approach is generally superior to <em>training from scratch</em>, due to the scarcity of high-quality medical imagery and labels. However, it is still unknown whether the learned features on natural imagery provide an optimal starting point for the downstream medical endoscopic imaging tasks. Intuitively, pre-training with imagery closer to the target domain could lead to better-suited feature representations. This study evaluates whether leveraging in-domain pre-training in gastrointestinal endoscopic image analysis has potential benefits compared to pre-training on natural images.</p><p>To this end, we present a dataset comprising of 5,014,174 gastrointestinal endoscopic images from eight different medical centers (GastroNet-5M), and exploit self-supervised learning with SimCLRv2, MoCov2 and DINO to learn relevant features for in-domain downstream tasks. The learned features are compared to features learned on natural images derived with multiple methods, and variable amounts of data and/or labels (e.g. Billion-scale semi-weakly supervised learning and supervised learning on ImageNet-21k). The effects of the evaluation is performed on five downstream data sets, particularly designed for a variety of gastrointestinal tasks, for example, GIANA for angiodyplsia detection and Kvasir-SEG for polyp segmentation.</p><p>The findings indicate that self-supervised domain-specific pre-training, specifically using the DINO framework, results into better performing models compared to any supervised pre-training on natural images. On the ResNet50 and Vision-Transformer-small architectures, utilizing self-supervised in-domain pre-training with DINO leads to an average performance boost of 1.63% and 4.62%, respectively, on the downstream datasets. This improvement is measured against the best performance achieved through pre-training on natural images within any of the evaluated frameworks.</p><p>Moreover, the in-domain pre-trained models also exhibit increased robustness against distortion perturbations (noise, contrast, blur, etc.), where the in-domain pre-trained ResNet50 and Vision-Transformer-small with DINO achieved on average 1.28% and 3.55% higher on the performance metrics, compared to the best performance found for pre-trained models on natural images.</p><p>Overall, this study highlights the importance of in-domain pre-training for improving the generic nature, scalability and performance of deep learning for medical image analysis. The GastroNet-5M pre-trained weights are made publicly available in our repository: <span><span>huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103298"},"PeriodicalIF":10.7000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1361841524002238/pdfft?md5=25ff2f1e7dfbb3491c0a72c80dc8e023&pid=1-s2.0-S1361841524002238-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency\",\"authors\":\"Tim G.W. Boers ,&nbsp;Kiki N. Fockens ,&nbsp;Joost A. van der Putten ,&nbsp;Tim J.M. Jaspers ,&nbsp;Carolus H.J. Kusters ,&nbsp;Jelmer B. Jukema ,&nbsp;Martijn R. Jong ,&nbsp;Maarten R. Struyvenberg ,&nbsp;Jeroen de Groof ,&nbsp;Jacques J. Bergman ,&nbsp;Peter H.N. de With ,&nbsp;Fons van der Sommen\",\"doi\":\"10.1016/j.media.2024.103298\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Pre-training deep learning models with large data sets of natural images, such as ImageNet, has become the standard for endoscopic image analysis. This approach is generally superior to <em>training from scratch</em>, due to the scarcity of high-quality medical imagery and labels. However, it is still unknown whether the learned features on natural imagery provide an optimal starting point for the downstream medical endoscopic imaging tasks. Intuitively, pre-training with imagery closer to the target domain could lead to better-suited feature representations. This study evaluates whether leveraging in-domain pre-training in gastrointestinal endoscopic image analysis has potential benefits compared to pre-training on natural images.</p><p>To this end, we present a dataset comprising of 5,014,174 gastrointestinal endoscopic images from eight different medical centers (GastroNet-5M), and exploit self-supervised learning with SimCLRv2, MoCov2 and DINO to learn relevant features for in-domain downstream tasks. The learned features are compared to features learned on natural images derived with multiple methods, and variable amounts of data and/or labels (e.g. Billion-scale semi-weakly supervised learning and supervised learning on ImageNet-21k). The effects of the evaluation is performed on five downstream data sets, particularly designed for a variety of gastrointestinal tasks, for example, GIANA for angiodyplsia detection and Kvasir-SEG for polyp segmentation.</p><p>The findings indicate that self-supervised domain-specific pre-training, specifically using the DINO framework, results into better performing models compared to any supervised pre-training on natural images. On the ResNet50 and Vision-Transformer-small architectures, utilizing self-supervised in-domain pre-training with DINO leads to an average performance boost of 1.63% and 4.62%, respectively, on the downstream datasets. This improvement is measured against the best performance achieved through pre-training on natural images within any of the evaluated frameworks.</p><p>Moreover, the in-domain pre-trained models also exhibit increased robustness against distortion perturbations (noise, contrast, blur, etc.), where the in-domain pre-trained ResNet50 and Vision-Transformer-small with DINO achieved on average 1.28% and 3.55% higher on the performance metrics, compared to the best performance found for pre-trained models on natural images.</p><p>Overall, this study highlights the importance of in-domain pre-training for improving the generic nature, scalability and performance of deep learning for medical image analysis. The GastroNet-5M pre-trained weights are made publicly available in our repository: <span><span>huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights</span><svg><path></path></svg></span>.</p></div>\",\"PeriodicalId\":18328,\"journal\":{\"name\":\"Medical image analysis\",\"volume\":\"98 \",\"pages\":\"Article 103298\"},\"PeriodicalIF\":10.7000,\"publicationDate\":\"2024-08-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1361841524002238/pdfft?md5=25ff2f1e7dfbb3491c0a72c80dc8e023&pid=1-s2.0-S1361841524002238-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical image analysis\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1361841524002238\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841524002238","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

使用大型自然图像数据集(如 ImageNet)预训练深度学习模型已成为内窥镜图像分析的标准。由于缺乏高质量的医学图像和标签,这种方法通常优于从头开始训练。然而,在自然图像上学习到的特征是否能为下游的医学内窥镜成像任务提供最佳起点,目前仍是个未知数。直观地说,使用更接近目标领域的图像进行预训练可以获得更合适的特征表征。本研究评估了在胃肠道内窥镜图像分析中利用域内预训练与在自然图像上进行预训练相比是否具有潜在优势。为此,我们提供了一个数据集,其中包括来自八个不同医疗中心的 5,014,174 幅胃肠道内窥镜图像(GastroNet-5M),并利用 SimCLRv2、MoCov2 和 DINO 进行自监督学习,为域内下游任务学习相关特征。将学习到的特征与通过多种方法、不同数量的数据和/或标签(如 Billion-scale 半弱监督学习和 ImageNet-21k 上的监督学习)在自然图像上学习到的特征进行比较。评估效果是在五个下游数据集上进行的,这些数据集特别为各种胃肠道任务而设计,例如用于血管增生检测的 GIANA 和用于息肉分割的 Kvasir-SEG。在ResNet50和Vision-Transformer-small架构上,利用DINO进行自监督域内预训练可使下游数据集的平均性能分别提高1.63%和4.62%。此外,域内预训练模型对失真扰动(噪声、对比度、模糊等)的鲁棒性也有所提高,其中使用 DINO 进行域内预训练的 ResNet50 和 Vision-Transformer-small 平均分别提高了 1.28% 和 3.55%。总之,这项研究强调了域内预训练对于提高深度学习在医学图像分析中的通用性、可扩展性和性能的重要性。GastroNet-5M 的预训练权重可在我们的资源库中公开获取:huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency

Pre-training deep learning models with large data sets of natural images, such as ImageNet, has become the standard for endoscopic image analysis. This approach is generally superior to training from scratch, due to the scarcity of high-quality medical imagery and labels. However, it is still unknown whether the learned features on natural imagery provide an optimal starting point for the downstream medical endoscopic imaging tasks. Intuitively, pre-training with imagery closer to the target domain could lead to better-suited feature representations. This study evaluates whether leveraging in-domain pre-training in gastrointestinal endoscopic image analysis has potential benefits compared to pre-training on natural images.

To this end, we present a dataset comprising of 5,014,174 gastrointestinal endoscopic images from eight different medical centers (GastroNet-5M), and exploit self-supervised learning with SimCLRv2, MoCov2 and DINO to learn relevant features for in-domain downstream tasks. The learned features are compared to features learned on natural images derived with multiple methods, and variable amounts of data and/or labels (e.g. Billion-scale semi-weakly supervised learning and supervised learning on ImageNet-21k). The effects of the evaluation is performed on five downstream data sets, particularly designed for a variety of gastrointestinal tasks, for example, GIANA for angiodyplsia detection and Kvasir-SEG for polyp segmentation.

The findings indicate that self-supervised domain-specific pre-training, specifically using the DINO framework, results into better performing models compared to any supervised pre-training on natural images. On the ResNet50 and Vision-Transformer-small architectures, utilizing self-supervised in-domain pre-training with DINO leads to an average performance boost of 1.63% and 4.62%, respectively, on the downstream datasets. This improvement is measured against the best performance achieved through pre-training on natural images within any of the evaluated frameworks.

Moreover, the in-domain pre-trained models also exhibit increased robustness against distortion perturbations (noise, contrast, blur, etc.), where the in-domain pre-trained ResNet50 and Vision-Transformer-small with DINO achieved on average 1.28% and 3.55% higher on the performance metrics, compared to the best performance found for pre-trained models on natural images.

Overall, this study highlights the importance of in-domain pre-training for improving the generic nature, scalability and performance of deep learning for medical image analysis. The GastroNet-5M pre-trained weights are made publicly available in our repository: huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Medical image analysis
Medical image analysis 工程技术-工程:生物医学
CiteScore
22.10
自引率
6.40%
发文量
309
审稿时长
6.6 months
期刊介绍: Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信