利用变换器统一突出物体检测中的全局-局部表征

IF 5.3 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Emerging Topics in Computational Intelligence Pub Date : 2024-04-02 DOI:10.1109/TETCI.2024.3380442

Sucheng Ren;Nanxuan Zhao;Qiang Wen;Guoqiang Han;Shengfeng He

{"title":"利用变换器统一突出物体检测中的全局-局部表征","authors":"Sucheng Ren;Nanxuan Zhao;Qiang Wen;Guoqiang Han;Shengfeng He","doi":"10.1109/TETCI.2024.3380442","DOIUrl":null,"url":null,"abstract":"The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can capture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and minimize the redundant features, making every output feature of transformer layers contribute uniquely to the final prediction. To decode features from the transformer, we propose a simple yet effective deeply-transformed decoder. The decoder densely decodes and upsamples the transformer features, generating the final saliency map with less noise injection. Experimental results demonstrate that our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large margin, with an average of 12.17% improvement in terms of Mean Absolute Error (MAE).","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 4","pages":"2870-2879"},"PeriodicalIF":5.3000,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Unifying Global-Local Representations in Salient Object Detection With Transformers\",\"authors\":\"Sucheng Ren;Nanxuan Zhao;Qiang Wen;Guoqiang Han;Shengfeng He\",\"doi\":\"10.1109/TETCI.2024.3380442\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can capture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and minimize the redundant features, making every output feature of transformer layers contribute uniquely to the final prediction. To decode features from the transformer, we propose a simple yet effective deeply-transformed decoder. The decoder densely decodes and upsamples the transformer features, generating the final saliency map with less noise injection. Experimental results demonstrate that our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large margin, with an average of 12.17% improvement in terms of Mean Absolute Error (MAE).\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"8 4\",\"pages\":\"2870-2879\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-04-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10489918/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10489918/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

长期以来，全卷积网络（FCN）一直主导着突出物体检测。然而，全卷积网络的局部性要求模型足够深，以拥有全局感受野，而这样的深层模型总是会导致局部细节的丢失。在本文中，我们在突出物体检测中引入了一种新的基于注意力的编码器--视觉转换器，以确保表征从浅层到深层的全局化。在极浅层的全局视图中，变换器编码器保留了更多的局部表征，以恢复最终显著性图中的空间细节。此外，由于每一层都能捕捉到上一层的全局视图，相邻层可以隐含地最大化表征差异，最小化冗余特征，从而使变换器层的每个输出特性都能为最终预测做出独特贡献。为了对变换层的特征进行解码，我们提出了一种简单而有效的深度变换解码器。解码器对变换器特征进行密集解码和高采样，在生成最终突出图时减少噪声注入。实验结果表明，在五个基准测试中，我们的方法明显优于其他基于 FCN 和变压器的方法，平均绝对误差（MAE）提高了 12.17%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Unifying Global-Local Representations in Salient Object Detection With Transformers

The fully convolutional network (FCN) has dominated salient object detection for a long period. However, the locality of CNN requires the model deep enough to have a global receptive field and such a deep model always leads to the loss of local details. In this paper, we introduce a new attention-based encoder, vision transformer, into salient object detection to ensure the globalization of the representations from shallow to deep layers. With the global view in very shallow layers, the transformer encoder preserves more local representations to recover the spatial details in final saliency maps. Besides, as each layer can capture a global view of its previous layer, adjacent layers can implicitly maximize the representation differences and minimize the redundant features, making every output feature of transformer layers contribute uniquely to the final prediction. To decode features from the transformer, we propose a simple yet effective deeply-transformed decoder. The decoder densely decodes and upsamples the transformer features, generating the final saliency map with less noise injection. Experimental results demonstrate that our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks by a large margin, with an average of 12.17% improvement in terms of Mean Absolute Error (MAE).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Emerging Topics in Computational Intelligence Mathematics-Control and Optimization

CiteScore

10.30

自引率

7.50%

发文量

147

期刊介绍： The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys. TETCI is an electronics only publication. TETCI publishes six issues per year. Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.