{"title":"基于多尺度卷积关注和聚合机制的特征金字塔网络语义分割","authors":"Shuo Hu , Xingwang Tao , Xingmiao Zhao","doi":"10.1016/j.jvcir.2025.104466","DOIUrl":null,"url":null,"abstract":"<div><div>Feature Pyramid Network (FPN) is an important structure for achieving feature fusion in semantic segmentation networks. However, most current FPN-based methods suffer from insufficient capture of cross-scale long-range information and exhibit aliasing effects during cross-scale fusion. In this paper, we propose the Multi-Scale Convolutional Attention and Aggregation Mechanisms Feature Pyramid Network (MAFPN). We first construct a Context Information Enhancement Module, which provides multi-scale global feature information for different levels through a adaptive aggregation Multi-Scale Convolutional Attention Module (AMSCAM). This approach alleviates the problem of insufficient cross-scale semantic information caused by top-down feature fusion. Furthermore, we propose a feature aggregation mechanism that promotes semantic alignment through a Lightweight Convolutional Attention Module (LFAM), thus enhancing the overall effectiveness of information fusion. Finally, we employ a lightweight self-attention mechanism to capture global long-range dependencies. MCANet is a Transformer-based encoder–decoder architecture, where the encoder adopts Uniformer and Biformer in separate configurations, and the decoder consists of MAFPN and FPN heads. When using Biformer as the encoder, MCANet achieves 49.98% mIoU on the ADE20K dataset and 80.95% and 80.45% mIoU on the Cityscapes validation and test sets, respectively. With Uniformer as the encoder, it attains 48.69% mIoU on ADE20K.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"110 ","pages":"Article 104466"},"PeriodicalIF":2.6000,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MCANet: Feature pyramid network with multi-scale convolutional attention and aggregation mechanisms for semantic segmentation\",\"authors\":\"Shuo Hu , Xingwang Tao , Xingmiao Zhao\",\"doi\":\"10.1016/j.jvcir.2025.104466\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Feature Pyramid Network (FPN) is an important structure for achieving feature fusion in semantic segmentation networks. However, most current FPN-based methods suffer from insufficient capture of cross-scale long-range information and exhibit aliasing effects during cross-scale fusion. In this paper, we propose the Multi-Scale Convolutional Attention and Aggregation Mechanisms Feature Pyramid Network (MAFPN). We first construct a Context Information Enhancement Module, which provides multi-scale global feature information for different levels through a adaptive aggregation Multi-Scale Convolutional Attention Module (AMSCAM). This approach alleviates the problem of insufficient cross-scale semantic information caused by top-down feature fusion. Furthermore, we propose a feature aggregation mechanism that promotes semantic alignment through a Lightweight Convolutional Attention Module (LFAM), thus enhancing the overall effectiveness of information fusion. Finally, we employ a lightweight self-attention mechanism to capture global long-range dependencies. MCANet is a Transformer-based encoder–decoder architecture, where the encoder adopts Uniformer and Biformer in separate configurations, and the decoder consists of MAFPN and FPN heads. When using Biformer as the encoder, MCANet achieves 49.98% mIoU on the ADE20K dataset and 80.95% and 80.45% mIoU on the Cityscapes validation and test sets, respectively. With Uniformer as the encoder, it attains 48.69% mIoU on ADE20K.</div></div>\",\"PeriodicalId\":54755,\"journal\":{\"name\":\"Journal of Visual Communication and Image Representation\",\"volume\":\"110 \",\"pages\":\"Article 104466\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2025-05-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Visual Communication and Image Representation\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S104732032500080X\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S104732032500080X","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
MCANet: Feature pyramid network with multi-scale convolutional attention and aggregation mechanisms for semantic segmentation
Feature Pyramid Network (FPN) is an important structure for achieving feature fusion in semantic segmentation networks. However, most current FPN-based methods suffer from insufficient capture of cross-scale long-range information and exhibit aliasing effects during cross-scale fusion. In this paper, we propose the Multi-Scale Convolutional Attention and Aggregation Mechanisms Feature Pyramid Network (MAFPN). We first construct a Context Information Enhancement Module, which provides multi-scale global feature information for different levels through a adaptive aggregation Multi-Scale Convolutional Attention Module (AMSCAM). This approach alleviates the problem of insufficient cross-scale semantic information caused by top-down feature fusion. Furthermore, we propose a feature aggregation mechanism that promotes semantic alignment through a Lightweight Convolutional Attention Module (LFAM), thus enhancing the overall effectiveness of information fusion. Finally, we employ a lightweight self-attention mechanism to capture global long-range dependencies. MCANet is a Transformer-based encoder–decoder architecture, where the encoder adopts Uniformer and Biformer in separate configurations, and the decoder consists of MAFPN and FPN heads. When using Biformer as the encoder, MCANet achieves 49.98% mIoU on the ADE20K dataset and 80.95% and 80.45% mIoU on the Cityscapes validation and test sets, respectively. With Uniformer as the encoder, it attains 48.69% mIoU on ADE20K.
期刊介绍:
The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.