FeedFormer:用于高效语义分割的重访变压器解码器

Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence Pub Date : 2023-06-26 DOI:10.1609/aaai.v37i2.25321

J. Shim, Hyunwoo Yu, Kyeongbo Kong, Suk-Ju Kang

{"title":"FeedFormer:用于高效语义分割的重访变压器解码器","authors":"J. Shim, Hyunwoo Yu, Kyeongbo Kong, Suk-Ju Kang","doi":"10.1609/aaai.v37i2.25321","DOIUrl":null,"url":null,"abstract":"With the success of Vision Transformer (ViT) in image classification, its variants have yielded great success in many downstream vision tasks. Among those, the semantic segmentation task has also benefited greatly from the advance of ViT variants. However, most studies of the transformer for semantic segmentation only focus on designing efficient transformer encoders, rarely giving attention to designing the decoder. Several studies make attempts in using the transformer decoder as the segmentation decoder with class-wise learnable query. Instead, we aim to directly use the encoder features as the queries. This paper proposes the Feature Enhancing Decoder transFormer (FeedFormer) that enhances structural information using the transformer decoder. Our goal is to decode the high-level encoder features using the lowest-level encoder feature. We do this by formulating high-level features as queries, and the lowest-level feature as the key and value. This enhances the high-level features by collecting the structural information from the lowest-level feature. Additionally, we use a simple reformation trick of pushing the encoder blocks to take the place of the existing self-attention module of the decoder to improve efficiency. We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Despite the minute computation, our model has achieved state-of-the-art performance in the performance computation trade-off. Our model FeedFormer-B0 surpasses SegFormer-B0 with 1.8% higher mIoU and 7.1% less computation on ADE20K, and 1.7% higher mIoU and 14.4% less computation on Cityscapes, respectively. Code will be released at: https://github.com/jhshim1995/FeedFormer.","PeriodicalId":74506,"journal":{"name":"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence","volume":"5 1","pages":"2263-2271"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation\",\"authors\":\"J. Shim, Hyunwoo Yu, Kyeongbo Kong, Suk-Ju Kang\",\"doi\":\"10.1609/aaai.v37i2.25321\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the success of Vision Transformer (ViT) in image classification, its variants have yielded great success in many downstream vision tasks. Among those, the semantic segmentation task has also benefited greatly from the advance of ViT variants. However, most studies of the transformer for semantic segmentation only focus on designing efficient transformer encoders, rarely giving attention to designing the decoder. Several studies make attempts in using the transformer decoder as the segmentation decoder with class-wise learnable query. Instead, we aim to directly use the encoder features as the queries. This paper proposes the Feature Enhancing Decoder transFormer (FeedFormer) that enhances structural information using the transformer decoder. Our goal is to decode the high-level encoder features using the lowest-level encoder feature. We do this by formulating high-level features as queries, and the lowest-level feature as the key and value. This enhances the high-level features by collecting the structural information from the lowest-level feature. Additionally, we use a simple reformation trick of pushing the encoder blocks to take the place of the existing self-attention module of the decoder to improve efficiency. We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Despite the minute computation, our model has achieved state-of-the-art performance in the performance computation trade-off. Our model FeedFormer-B0 surpasses SegFormer-B0 with 1.8% higher mIoU and 7.1% less computation on ADE20K, and 1.7% higher mIoU and 14.4% less computation on Cityscapes, respectively. Code will be released at: https://github.com/jhshim1995/FeedFormer.\",\"PeriodicalId\":74506,\"journal\":{\"name\":\"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence\",\"volume\":\"5 1\",\"pages\":\"2263-2271\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1609/aaai.v37i2.25321\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/aaai.v37i2.25321","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着视觉转换器(Vision Transformer, ViT)在图像分类中的成功，其变体在许多下游视觉任务中也取得了巨大的成功。其中，语义分割任务也得益于ViT变体的进步。然而，大多数用于语义分割的变压器的研究只关注于设计高效的变压器编码器，很少关注解码器的设计。一些研究尝试将变换解码器作为分段解码器，并提出了类可学习查询。相反，我们的目标是直接使用编码器特性作为查询。本文提出了一种特征增强解码器变压器(FeedFormer)，利用变压器解码器增强结构信息。我们的目标是使用最低级编码器特征来解码高级编码器特征。我们通过将高级特性表述为查询，将最低级特性表述为键和值来实现这一点。这通过从最低级别的特征中收集结构信息来增强高级特征。此外，我们还采用了一种简单的改造技巧，即推入编码器块来取代现有的解码器自关注模块，以提高效率。我们在常用的语义分割数据集上展示了我们的解码器与各种轻量级的基于转换器的解码器的优势。尽管计算时间很短，但我们的模型在性能计算权衡方面取得了最先进的性能。我们的模型FeedFormer-B0超过SegFormer-B0，在ADE20K上的mIoU提高1.8%，计算量减少7.1%，在cityscape上的mIoU提高1.7%，计算量减少14.4%。代码将在https://github.com/jhshim1995/FeedFormer上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation

With the success of Vision Transformer (ViT) in image classification, its variants have yielded great success in many downstream vision tasks. Among those, the semantic segmentation task has also benefited greatly from the advance of ViT variants. However, most studies of the transformer for semantic segmentation only focus on designing efficient transformer encoders, rarely giving attention to designing the decoder. Several studies make attempts in using the transformer decoder as the segmentation decoder with class-wise learnable query. Instead, we aim to directly use the encoder features as the queries. This paper proposes the Feature Enhancing Decoder transFormer (FeedFormer) that enhances structural information using the transformer decoder. Our goal is to decode the high-level encoder features using the lowest-level encoder feature. We do this by formulating high-level features as queries, and the lowest-level feature as the key and value. This enhances the high-level features by collecting the structural information from the lowest-level feature. Additionally, we use a simple reformation trick of pushing the encoder blocks to take the place of the existing self-attention module of the decoder to improve efficiency. We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Despite the minute computation, our model has achieved state-of-the-art performance in the performance computation trade-off. Our model FeedFormer-B0 surpasses SegFormer-B0 with 1.8% higher mIoU and 7.1% less computation on ADE20K, and 1.7% higher mIoU and 14.4% less computation on Cityscapes, respectively. Code will be released at: https://github.com/jhshim1995/FeedFormer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence

自引率

0.00%

发文量