MSCA：基于多尺度交叉关注和信息提取的多镜头分割框架

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-06-10 DOI:10.1016/j.cviu.2025.104419

Zhihao Ren , Shengning Lu , Xinhua Wang , Yaoming Liu , Yong Liang

{"title":"MSCA：基于多尺度交叉关注和信息提取的多镜头分割框架","authors":"Zhihao Ren , Shengning Lu , Xinhua Wang , Yaoming Liu , Yong Liang","doi":"10.1016/j.cviu.2025.104419","DOIUrl":null,"url":null,"abstract":"<div><div>Few-Shot Semantic Segmentation (FSS) aims to achieve precise pixel-level segmentation of target objects in query images using only a small number of annotated support images. The main challenge lies in effectively capturing and transferring critical information from support samples while establishing fine-grained semantic associations between query and support images to improve segmentation accuracy. However, existing methods struggle with spatial alignment issues caused by intra-class variations and inter-class visual similarities, and they fail to fully integrate high-level and low-level decoder features. To address these limitations, we propose a novel framework based on cross-scale interactive attention mechanisms. This framework employs a hybrid mask-guided multi-scale feature fusion strategy, constructing a cross-scale attention network that spans from local details to global context. It dynamically enhances target region representation and alleviates spatial misalignment issues. Furthermore, we design a hierarchical multi-axis decoding architecture that progressively integrates multi-resolution feature pathways, enabling the model to focus on semantic associations within foreground regions. Experimental results show that our Multi-Scale Cross-Attention (MSCA) model performs exceptionally well on the PASCAL-5i and COCO-20i benchmark datasets, achieving highly competitive results. Notably, the model contains only 1.86 million learnable parameters, demonstrating its efficiency and practical applicability.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104419"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MSCA: A few-shot segmentation framework driven by multi-scale cross-attention and information extraction\",\"authors\":\"Zhihao Ren , Shengning Lu , Xinhua Wang , Yaoming Liu , Yong Liang\",\"doi\":\"10.1016/j.cviu.2025.104419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Few-Shot Semantic Segmentation (FSS) aims to achieve precise pixel-level segmentation of target objects in query images using only a small number of annotated support images. The main challenge lies in effectively capturing and transferring critical information from support samples while establishing fine-grained semantic associations between query and support images to improve segmentation accuracy. However, existing methods struggle with spatial alignment issues caused by intra-class variations and inter-class visual similarities, and they fail to fully integrate high-level and low-level decoder features. To address these limitations, we propose a novel framework based on cross-scale interactive attention mechanisms. This framework employs a hybrid mask-guided multi-scale feature fusion strategy, constructing a cross-scale attention network that spans from local details to global context. It dynamically enhances target region representation and alleviates spatial misalignment issues. Furthermore, we design a hierarchical multi-axis decoding architecture that progressively integrates multi-resolution feature pathways, enabling the model to focus on semantic associations within foreground regions. Experimental results show that our Multi-Scale Cross-Attention (MSCA) model performs exceptionally well on the PASCAL-5i and COCO-20i benchmark datasets, achieving highly competitive results. Notably, the model contains only 1.86 million learnable parameters, demonstrating its efficiency and practical applicability.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"259 \",\"pages\":\"Article 104419\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001420\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001420","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

少镜头语义分割（Few-Shot Semantic Segmentation， FSS）的目的是利用少量带注释的支持图像对查询图像中的目标物体进行精确的像素级分割。主要的挑战在于有效地从支持样本中捕获和传递关键信息，同时在查询和支持图像之间建立细粒度的语义关联，以提高分割精度。然而，现有的方法难以解决由类内变化和类间视觉相似性引起的空间对齐问题，并且无法充分整合高级和低级解码器的特征。为了解决这些限制，我们提出了一个基于跨尺度交互注意机制的新框架。该框架采用混合掩模引导的多尺度特征融合策略，构建了从局部细节到全局背景的跨尺度关注网络。它动态地增强了目标区域的表示，缓解了空间错位问题。此外，我们设计了一个分层的多轴解码架构，逐步集成多分辨率特征路径，使模型能够专注于前景区域内的语义关联。实验结果表明，我们的多尺度交叉注意（MSCA）模型在PASCAL-5i和COCO-20i基准数据集上表现优异，取得了极具竞争力的结果。值得注意的是，该模型仅包含186万个可学习参数，证明了其有效性和实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MSCA: A few-shot segmentation framework driven by multi-scale cross-attention and information extraction

Few-Shot Semantic Segmentation (FSS) aims to achieve precise pixel-level segmentation of target objects in query images using only a small number of annotated support images. The main challenge lies in effectively capturing and transferring critical information from support samples while establishing fine-grained semantic associations between query and support images to improve segmentation accuracy. However, existing methods struggle with spatial alignment issues caused by intra-class variations and inter-class visual similarities, and they fail to fully integrate high-level and low-level decoder features. To address these limitations, we propose a novel framework based on cross-scale interactive attention mechanisms. This framework employs a hybrid mask-guided multi-scale feature fusion strategy, constructing a cross-scale attention network that spans from local details to global context. It dynamically enhances target region representation and alleviates spatial misalignment issues. Furthermore, we design a hierarchical multi-axis decoding architecture that progressively integrates multi-resolution feature pathways, enabling the model to focus on semantic associations within foreground regions. Experimental results show that our Multi-Scale Cross-Attention (MSCA) model performs exceptionally well on the PASCAL-5i and COCO-20i benchmark datasets, achieving highly competitive results. Notably, the model contains only 1.86 million learnable parameters, demonstrating its efficiency and practical applicability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems