SG-UNet: Hybrid self-guided transformer and U-Net fusion for CT image segmentation

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation Pub Date : 2025-02-21 DOI:10.1016/j.jvcir.2025.104416

Chunjie Lv , Biyuan Li , Gaowei Sun , Xiuwei Wang , Pengfei Cai , Jun Yan

{"title":"SG-UNet: Hybrid self-guided transformer and U-Net fusion for CT image segmentation","authors":"Chunjie Lv , Biyuan Li , Gaowei Sun , Xiuwei Wang , Pengfei Cai , Jun Yan","doi":"10.1016/j.jvcir.2025.104416","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, transformer-based paradigms have made substantial inroads in the domain of CT image segmentation, The Swin Transformer has garnered praise for its strong performance, but it often struggles with capturing fine-grained details, especially in complex tasks like CT image segmentation, where distinguishing subtle differences in key areas is challenging. Additionally, due to its fixed window attention mechanism, Swin Transformer tends to overemphasize local features while overlooking global context, leading to insufficient understanding of critical information and potential loss of important details. To address the limitations of the Swin Transformer, we introduce an innovative U-shaped Hybrid Self-Guided Transformer network (SG-UNet), specifically tailored for CT image segmentation. Our approach refines the self-attention mechanism by integrating hybrid attention with self-guided attention. The hybrid attention mechanism employs adaptive fine-grained global self-attention to capture low-level details and guide token assignment in salient regions, while the self-guided attention dynamically reallocates tokens, prioritizing target regions and reducing attention computation for non-target areas. This synergy enables the model to autonomously refine saliency maps and reassign tokens based on regional importance. To enhance training dynamics, we incorporate a combination of CELoss and BDLoss, which improves training stability, mitigates gradient instability, and accelerates convergence. Additionally, a dynamic learning rate adjustment strategy is employed to optimize the model’s learning process in real-time, ensuring smoother convergence and enhanced performance. Empirical validation on the Synapse and lung datasets demonstrates the superior segmentation performance of the Hybrid Self-Guided Transformer UNet, achieving DSC and HD scores of 82.91 % and 16.46 mm on the Synapse dataset, and 98.13 % and 6.34 mm on the lung dataset, respectively. These results underscore both the effectiveness and the advanced capabilities of our model in segmentation tasks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104416"},"PeriodicalIF":2.6000,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325000306","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, transformer-based paradigms have made substantial inroads in the domain of CT image segmentation, The Swin Transformer has garnered praise for its strong performance, but it often struggles with capturing fine-grained details, especially in complex tasks like CT image segmentation, where distinguishing subtle differences in key areas is challenging. Additionally, due to its fixed window attention mechanism, Swin Transformer tends to overemphasize local features while overlooking global context, leading to insufficient understanding of critical information and potential loss of important details. To address the limitations of the Swin Transformer, we introduce an innovative U-shaped Hybrid Self-Guided Transformer network (SG-UNet), specifically tailored for CT image segmentation. Our approach refines the self-attention mechanism by integrating hybrid attention with self-guided attention. The hybrid attention mechanism employs adaptive fine-grained global self-attention to capture low-level details and guide token assignment in salient regions, while the self-guided attention dynamically reallocates tokens, prioritizing target regions and reducing attention computation for non-target areas. This synergy enables the model to autonomously refine saliency maps and reassign tokens based on regional importance. To enhance training dynamics, we incorporate a combination of CELoss and BDLoss, which improves training stability, mitigates gradient instability, and accelerates convergence. Additionally, a dynamic learning rate adjustment strategy is employed to optimize the model’s learning process in real-time, ensuring smoother convergence and enhanced performance. Empirical validation on the Synapse and lung datasets demonstrates the superior segmentation performance of the Hybrid Self-Guided Transformer UNet, achieving DSC and HD scores of 82.91 % and 16.46 mm on the Synapse dataset, and 98.13 % and 6.34 mm on the lung dataset, respectively. These results underscore both the effectiveness and the advanced capabilities of our model in segmentation tasks.

查看原文本刊更多论文

SG-UNet：混合自引导变压器和U-Net融合的CT图像分割

近年来，基于变压器的范例在CT图像分割领域取得了实质性的进展，Swin Transformer因其强大的性能而赢得了赞誉，但它经常在捕捉细粒度细节方面遇到困难，特别是在像CT图像分割这样的复杂任务中，在关键区域区分细微差异是具有挑战性的。此外，由于Swin Transformer的固定窗口注意机制，它往往会过度强调局部特征而忽略全局上下文，从而导致对关键信息的理解不足，并可能丢失重要细节。为了解决Swin变压器的局限性，我们推出了一种创新的u形混合自导向变压器网络（SG-UNet），专门为CT图像分割量身定制。我们的方法通过整合混合注意和自我引导注意来完善自注意机制。混合注意机制采用自适应细粒度全局自注意捕获低层次细节并引导显著区域的标记分配，而自引导注意动态重新分配标记，优先考虑目标区域并减少非目标区域的注意计算。这种协同作用使模型能够自主地优化显著性地图，并根据区域重要性重新分配令牌。为了增强训练动态，我们结合了celloss和BDLoss，这提高了训练稳定性，减轻了梯度不稳定性，并加速了收敛。此外，采用动态学习率调整策略实时优化模型的学习过程，保证了模型更平滑的收敛和性能的提高。在Synapse和lung数据集上的实证验证表明，Hybrid Self-Guided Transformer UNet具有优越的分割性能，在Synapse数据集上的DSC和HD分数分别为82.91%和16.46 mm，在lung数据集上的DSC和HD分数分别为98.13%和6.34 mm。这些结果强调了我们的模型在分割任务中的有效性和高级功能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Visual Communication and Image Representation 工程技术-计算机：软件工程

CiteScore

5.40

自引率

11.50%

发文量

188

审稿时长

9.9 months

期刊介绍： The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.