Dynamic DETR: End-to-End Object Detection with Dynamic Attention

2021 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2021-10-01 DOI:10.1109/ICCV48922.2021.00298

Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, Lei Zhang

{"title":"Dynamic DETR: End-to-End Object Detection with Dynamic Attention","authors":"Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, Lei Zhang","doi":"10.1109/ICCV48922.2021.00298","DOIUrl":null,"url":null,"abstract":"In this paper, we present a novel Dynamic DETR (Detection with Transformers) approach by introducing dynamic attentions into both the encoder and decoder stages of DETR to break its two limitations on small feature resolution and slow training convergence. To address the first limitation, which is due to the quadratic computational complexity of the self-attention module in Transformer encoders, we propose a dynamic encoder to approximate the Transformer encoder’s attention mechanism using a convolution-based dynamic encoder with various attention types. Such an encoder can dynamically adjust attentions based on multiple factors such as scale importance, spatial importance, and representation (i.e., feature dimension) importance. To mitigate the second limitation of learning difficulty, we introduce a dynamic decoder by replacing the cross-attention module with a ROI-based dynamic attention in the Transformer decoder. Such a decoder effectively assists Transformers to focus on region of interests from a coarse-to-fine manner and dramatically lowers the learning difficulty, leading to a much faster convergence with fewer training epochs. We conduct a series of experiments to demonstrate our advantages. Our Dynamic DETR significantly reduces the training epochs (by 14×), yet results in a much better performance (by 3.6 on mAP). Meanwhile, in the standard 1× setup with ResNet-50 backbone, we archive a new state-of-the-art performance that further proves the learning effectiveness of the proposed approach.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"31 1","pages":"2968-2977"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"142","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV48922.2021.00298","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 142

Abstract

In this paper, we present a novel Dynamic DETR (Detection with Transformers) approach by introducing dynamic attentions into both the encoder and decoder stages of DETR to break its two limitations on small feature resolution and slow training convergence. To address the first limitation, which is due to the quadratic computational complexity of the self-attention module in Transformer encoders, we propose a dynamic encoder to approximate the Transformer encoder’s attention mechanism using a convolution-based dynamic encoder with various attention types. Such an encoder can dynamically adjust attentions based on multiple factors such as scale importance, spatial importance, and representation (i.e., feature dimension) importance. To mitigate the second limitation of learning difficulty, we introduce a dynamic decoder by replacing the cross-attention module with a ROI-based dynamic attention in the Transformer decoder. Such a decoder effectively assists Transformers to focus on region of interests from a coarse-to-fine manner and dramatically lowers the learning difficulty, leading to a much faster convergence with fewer training epochs. We conduct a series of experiments to demonstrate our advantages. Our Dynamic DETR significantly reduces the training epochs (by 14×), yet results in a much better performance (by 3.6 on mAP). Meanwhile, in the standard 1× setup with ResNet-50 backbone, we archive a new state-of-the-art performance that further proves the learning effectiveness of the proposed approach.

查看原文本刊更多论文

动态DETR:具有动态关注的端到端目标检测

在本文中，我们提出了一种新的动态DETR(带变压器检测)方法，通过在DETR的编码器和解码器阶段引入动态关注来打破其小特征分辨率和慢训练收敛的两个限制。为了解决第一个限制，这是由于Transformer编码器中自注意模块的二次计算复杂性，我们提出了一个动态编码器，使用基于卷积的动态编码器来近似Transformer编码器的注意机制，并具有各种注意类型。这种编码器可以根据尺度重要性、空间重要性和表征(即特征维度)重要性等多个因素动态调整注意力。为了减轻学习困难的第二个限制，我们引入了一个动态解码器，通过在Transformer解码器中用基于roi的动态注意替换交叉注意模块。这样的解码器有效地帮助变形金刚从粗到精的方式专注于感兴趣的区域，并显著降低了学习难度，以更少的训练次数实现更快的收敛。我们进行了一系列的实验来证明我们的优势。我们的Dynamic DETR显著地减少了训练时间(减少了14倍)，但却产生了更好的性能(在mAP上减少了3.6次)。同时，在具有ResNet-50骨干网的标准1x设置中，我们存档了一个新的最先进的性能，进一步证明了所提出方法的学习有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量