Transformer vs. CNN – A Comparison on Knee Segmentation in Ultrasound Images

EPiC Series in Health Sciences Pub Date : 1900-01-01 DOI:10.29007/cqcv

Peter Brößner, B. Hohlmann, K. Radermacher

{"title":"Transformer vs. CNN – A Comparison on Knee Segmentation in Ultrasound Images","authors":"Peter Brößner, B. Hohlmann, K. Radermacher","doi":"10.29007/cqcv","DOIUrl":null,"url":null,"abstract":"The automated and robust segmentation of bone surfaces in ultrasound (US) images can open up new fields of application for US imaging in computer-assisted orthopedic surgery, e.g. for the patient-specific planning process in computer-assisted knee replacement. For the automated, deep learning-based segmentation of medical images, CNN-based methods have been the state of the art over the last years, while recently Transformer-based methods are on the rise in computer vision. To compare these methods with respect to US image segmentation, in this paper the recent Transformer- based Swin-UNet is exemplarily benchmarked against the commonly used CNN-based nnUNet on the application of in-vivo 2D US knee segmentation.Trained and tested on our own dataset with 8166 annotated images (split in 7155 and 1011 images respectively), both the nnUNet and the pre-trained Swin-UNet show a Dice coefficient of 0.78 during testing. For distances between skeletonized labels and predictions, a symmetric Hausdorff distance of 44.69 pixels and a symmetric surface distance of 5.77 pixels is found for nnUNet as compared to 42.78 pixels and 5.68 pixels respectively for the Swin-UNet. Based on qualitative assessment, the Transformer-based Swin-UNet appears to benefit from its capability of learning global relationships as compared to the CNN-based nnUNet, while the latter shows more consistent and smooth predictions on a local level, presumably due to the character of convolution operation. Besides, the Swin-UNet requires generalized pre-training to be competitive.Since both architectures are evenly suited for the task at hand, for our future work, hybrid architectures combining the characteristic advantages of Transformer-based and CNN-based methods seem promising for US image segmentation.","PeriodicalId":385854,"journal":{"name":"EPiC Series in Health Sciences","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EPiC Series in Health Sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29007/cqcv","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The automated and robust segmentation of bone surfaces in ultrasound (US) images can open up new fields of application for US imaging in computer-assisted orthopedic surgery, e.g. for the patient-specific planning process in computer-assisted knee replacement. For the automated, deep learning-based segmentation of medical images, CNN-based methods have been the state of the art over the last years, while recently Transformer-based methods are on the rise in computer vision. To compare these methods with respect to US image segmentation, in this paper the recent Transformer- based Swin-UNet is exemplarily benchmarked against the commonly used CNN-based nnUNet on the application of in-vivo 2D US knee segmentation.Trained and tested on our own dataset with 8166 annotated images (split in 7155 and 1011 images respectively), both the nnUNet and the pre-trained Swin-UNet show a Dice coefficient of 0.78 during testing. For distances between skeletonized labels and predictions, a symmetric Hausdorff distance of 44.69 pixels and a symmetric surface distance of 5.77 pixels is found for nnUNet as compared to 42.78 pixels and 5.68 pixels respectively for the Swin-UNet. Based on qualitative assessment, the Transformer-based Swin-UNet appears to benefit from its capability of learning global relationships as compared to the CNN-based nnUNet, while the latter shows more consistent and smooth predictions on a local level, presumably due to the character of convolution operation. Besides, the Swin-UNet requires generalized pre-training to be competitive.Since both architectures are evenly suited for the task at hand, for our future work, hybrid architectures combining the characteristic advantages of Transformer-based and CNN-based methods seem promising for US image segmentation.

查看原文本刊更多论文

Transformer与CNN——超声图像中膝关节分割的比较

超声(US)图像中骨表面的自动和鲁棒分割可以为计算机辅助骨科手术中的US成像开辟新的应用领域，例如计算机辅助膝关节置换术中针对患者的规划过程。对于自动的、基于深度学习的医学图像分割，基于cnn的方法在过去几年中一直是最先进的，而最近基于transformer的方法在计算机视觉中正在兴起。为了比较这些方法在US图像分割方面的效果，本文将最近基于Transformer的swun - unet与常用的基于cnn的nnUNet在体内2D US膝关节分割中的应用进行了典型的基准测试。在我们自己的数据集上训练和测试了8166个带注释的图像(分别分为7155和1011个图像)，nnUNet和预训练的swun - unet在测试期间都显示出0.78的Dice系数。对于骨架化标签和预测之间的距离，nnUNet的对称Hausdorff距离为44.69像素，对称表面距离为5.77像素，而swan - unet的对称表面距离分别为42.78像素和5.68像素。基于定性评估，与基于cnn的nnUNet相比，基于transformer的swwin - unet似乎受益于其学习全局关系的能力，而后者在局部水平上显示出更一致和平滑的预测，可能是由于卷积操作的特性。此外，swwin - unet需要广泛的预训练才能具有竞争力。由于这两种架构都非常适合手头的任务，因此在我们未来的工作中，结合基于transformer和基于cnn的方法的特征优势的混合架构似乎很有希望用于美国图像分割。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

EPiC Series in Health Sciences

自引率

0.00%

发文量