基于多尺度卷积关注和聚合机制的特征金字塔网络语义分割

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation Pub Date : 2025-05-02 DOI:10.1016/j.jvcir.2025.104466

Shuo Hu , Xingwang Tao , Xingmiao Zhao

{"title":"基于多尺度卷积关注和聚合机制的特征金字塔网络语义分割","authors":"Shuo Hu , Xingwang Tao , Xingmiao Zhao","doi":"10.1016/j.jvcir.2025.104466","DOIUrl":null,"url":null,"abstract":"<div><div>Feature Pyramid Network (FPN) is an important structure for achieving feature fusion in semantic segmentation networks. However, most current FPN-based methods suffer from insufficient capture of cross-scale long-range information and exhibit aliasing effects during cross-scale fusion. In this paper, we propose the Multi-Scale Convolutional Attention and Aggregation Mechanisms Feature Pyramid Network (MAFPN). We first construct a Context Information Enhancement Module, which provides multi-scale global feature information for different levels through a adaptive aggregation Multi-Scale Convolutional Attention Module (AMSCAM). This approach alleviates the problem of insufficient cross-scale semantic information caused by top-down feature fusion. Furthermore, we propose a feature aggregation mechanism that promotes semantic alignment through a Lightweight Convolutional Attention Module (LFAM), thus enhancing the overall effectiveness of information fusion. Finally, we employ a lightweight self-attention mechanism to capture global long-range dependencies. MCANet is a Transformer-based encoder–decoder architecture, where the encoder adopts Uniformer and Biformer in separate configurations, and the decoder consists of MAFPN and FPN heads. When using Biformer as the encoder, MCANet achieves 49.98% mIoU on the ADE20K dataset and 80.95% and 80.45% mIoU on the Cityscapes validation and test sets, respectively. With Uniformer as the encoder, it attains 48.69% mIoU on ADE20K.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104466"},"PeriodicalIF":2.6000,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MCANet: Feature pyramid network with multi-scale convolutional attention and aggregation mechanisms for semantic segmentation\",\"authors\":\"Shuo Hu , Xingwang Tao , Xingmiao Zhao\",\"doi\":\"10.1016/j.jvcir.2025.104466\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Feature Pyramid Network (FPN) is an important structure for achieving feature fusion in semantic segmentation networks. However, most current FPN-based methods suffer from insufficient capture of cross-scale long-range information and exhibit aliasing effects during cross-scale fusion. In this paper, we propose the Multi-Scale Convolutional Attention and Aggregation Mechanisms Feature Pyramid Network (MAFPN). We first construct a Context Information Enhancement Module, which provides multi-scale global feature information for different levels through a adaptive aggregation Multi-Scale Convolutional Attention Module (AMSCAM). This approach alleviates the problem of insufficient cross-scale semantic information caused by top-down feature fusion. Furthermore, we propose a feature aggregation mechanism that promotes semantic alignment through a Lightweight Convolutional Attention Module (LFAM), thus enhancing the overall effectiveness of information fusion. Finally, we employ a lightweight self-attention mechanism to capture global long-range dependencies. MCANet is a Transformer-based encoder–decoder architecture, where the encoder adopts Uniformer and Biformer in separate configurations, and the decoder consists of MAFPN and FPN heads. When using Biformer as the encoder, MCANet achieves 49.98% mIoU on the ADE20K dataset and 80.95% and 80.45% mIoU on the Cityscapes validation and test sets, respectively. With Uniformer as the encoder, it attains 48.69% mIoU on ADE20K.</div></div>\",\"PeriodicalId\":54755,\"journal\":{\"name\":\"Journal of Visual Communication and Image Representation\",\"volume\":\"110 \",\"pages\":\"Article 104466\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-05-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Visual Communication and Image Representation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S104732032500080X\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S104732032500080X","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

特征金字塔网络（FPN）是语义分割网络中实现特征融合的重要结构。然而，目前大多数基于fpn的方法存在对跨尺度远程信息捕获不足的问题，并且在跨尺度融合过程中存在混叠效应。在本文中，我们提出了多尺度卷积注意和聚集机制特征金字塔网络（MAFPN）。首先构建上下文信息增强模块，通过自适应聚合多尺度卷积注意模块（AMSCAM）提供不同层次的多尺度全局特征信息。该方法缓解了自顶向下特征融合导致的跨尺度语义信息不足的问题。此外，我们提出了一种特征聚合机制，通过轻量级卷积注意模块（LFAM）促进语义对齐，从而提高信息融合的整体有效性。最后，我们使用轻量级的自关注机制来捕获全局远程依赖关系。MCANet是一种基于变压器的编解码器架构，其中编码器采用独立配置的Uniformer和Biformer，解码器由MAFPN和FPN头组成。当使用Biformer作为编码器时，MCANet在ADE20K数据集上达到49.98%的mIoU，在cityscape验证集和测试集上分别达到80.95%和80.45%的mIoU。采用Uniformer作为编码器，在ADE20K上达到48.69% mIoU。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MCANet: Feature pyramid network with multi-scale convolutional attention and aggregation mechanisms for semantic segmentation

Feature Pyramid Network (FPN) is an important structure for achieving feature fusion in semantic segmentation networks. However, most current FPN-based methods suffer from insufficient capture of cross-scale long-range information and exhibit aliasing effects during cross-scale fusion. In this paper, we propose the Multi-Scale Convolutional Attention and Aggregation Mechanisms Feature Pyramid Network (MAFPN). We first construct a Context Information Enhancement Module, which provides multi-scale global feature information for different levels through a adaptive aggregation Multi-Scale Convolutional Attention Module (AMSCAM). This approach alleviates the problem of insufficient cross-scale semantic information caused by top-down feature fusion. Furthermore, we propose a feature aggregation mechanism that promotes semantic alignment through a Lightweight Convolutional Attention Module (LFAM), thus enhancing the overall effectiveness of information fusion. Finally, we employ a lightweight self-attention mechanism to capture global long-range dependencies. MCANet is a Transformer-based encoder–decoder architecture, where the encoder adopts Uniformer and Biformer in separate configurations, and the decoder consists of MAFPN and FPN heads. When using Biformer as the encoder, MCANet achieves 49.98% mIoU on the ADE20K dataset and 80.95% and 80.45% mIoU on the Cityscapes validation and test sets, respectively. With Uniformer as the encoder, it attains 48.69% mIoU on ADE20K.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Visual Communication and Image Representation 工程技术-计算机：软件工程

CiteScore

5.40

自引率

11.50%

发文量

188

审稿时长

9.9 months

期刊介绍： The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.