Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh
{"title":"MSDNet:通过变压器引导的原型设计实现少镜头语义分割的多尺度解码器","authors":"Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh","doi":"arxiv-2409.11316","DOIUrl":null,"url":null,"abstract":"Few-shot Semantic Segmentation addresses the challenge of segmenting objects\nin query images with only a handful of annotated examples. However, many\nprevious state-of-the-art methods either have to discard intricate local\nsemantic features or suffer from high computational complexity. To address\nthese challenges, we propose a new Few-shot Semantic Segmentation framework\nbased on the transformer architecture. Our approach introduces the spatial\ntransformer decoder and the contextual mask generation module to improve the\nrelational understanding between support and query images. Moreover, we\nintroduce a multi-scale decoder to refine the segmentation mask by\nincorporating features from different resolutions in a hierarchical manner.\nAdditionally, our approach integrates global features from intermediate encoder\nstages to improve contextual understanding, while maintaining a lightweight\nstructure to reduce complexity. This balance between performance and efficiency\nenables our method to achieve state-of-the-art results on benchmark datasets\nsuch as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings.\nNotably, our model with only 1.5 million parameters demonstrates competitive\nperformance while overcoming limitations of existing methodologies.\nhttps://github.com/amirrezafateh/MSDNet","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping\",\"authors\":\"Amirreza Fateh, Mohammad Reza Mohammadi, Mohammad Reza Jahed Motlagh\",\"doi\":\"arxiv-2409.11316\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Few-shot Semantic Segmentation addresses the challenge of segmenting objects\\nin query images with only a handful of annotated examples. However, many\\nprevious state-of-the-art methods either have to discard intricate local\\nsemantic features or suffer from high computational complexity. To address\\nthese challenges, we propose a new Few-shot Semantic Segmentation framework\\nbased on the transformer architecture. Our approach introduces the spatial\\ntransformer decoder and the contextual mask generation module to improve the\\nrelational understanding between support and query images. Moreover, we\\nintroduce a multi-scale decoder to refine the segmentation mask by\\nincorporating features from different resolutions in a hierarchical manner.\\nAdditionally, our approach integrates global features from intermediate encoder\\nstages to improve contextual understanding, while maintaining a lightweight\\nstructure to reduce complexity. This balance between performance and efficiency\\nenables our method to achieve state-of-the-art results on benchmark datasets\\nsuch as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings.\\nNotably, our model with only 1.5 million parameters demonstrates competitive\\nperformance while overcoming limitations of existing methodologies.\\nhttps://github.com/amirrezafateh/MSDNet\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11316\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping
Few-shot Semantic Segmentation addresses the challenge of segmenting objects
in query images with only a handful of annotated examples. However, many
previous state-of-the-art methods either have to discard intricate local
semantic features or suffer from high computational complexity. To address
these challenges, we propose a new Few-shot Semantic Segmentation framework
based on the transformer architecture. Our approach introduces the spatial
transformer decoder and the contextual mask generation module to improve the
relational understanding between support and query images. Moreover, we
introduce a multi-scale decoder to refine the segmentation mask by
incorporating features from different resolutions in a hierarchical manner.
Additionally, our approach integrates global features from intermediate encoder
stages to improve contextual understanding, while maintaining a lightweight
structure to reduce complexity. This balance between performance and efficiency
enables our method to achieve state-of-the-art results on benchmark datasets
such as $PASCAL-5^i$ and $COCO-20^i$ in both 1-shot and 5-shot settings.
Notably, our model with only 1.5 million parameters demonstrates competitive
performance while overcoming limitations of existing methodologies.
https://github.com/amirrezafateh/MSDNet