动态粒度编码器的视觉变压器

Advances in neural information processing systems Pub Date : 2023-01-10 DOI:10.48550/arXiv.2301.03831

Lin Song, Songyang Zhang, Songtao Liu, Zeming Li, Xuming He, Hongbin Sun, Jian Sun, Nanning Zheng

{"title":"动态粒度编码器的视觉变压器","authors":"Lin Song, Songyang Zhang, Songtao Liu, Zeming Li, Xuming He, Hongbin Sun, Jian Sun, Nanning Zheng","doi":"10.48550/arXiv.2301.03831","DOIUrl":null,"url":null,"abstract":"Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack.","PeriodicalId":72099,"journal":{"name":"Advances in neural information processing systems","volume":"12 1","pages":"5770-5783"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"Dynamic Grained Encoder for Vision Transformers\",\"authors\":\"Lin Song, Songyang Zhang, Songtao Liu, Zeming Li, Xuming He, Hongbin Sun, Jian Sun, Nanning Zheng\",\"doi\":\"10.48550/arXiv.2301.03831\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack.\",\"PeriodicalId\":72099,\"journal\":{\"name\":\"Advances in neural information processing systems\",\"volume\":\"12 1\",\"pages\":\"5770-5783\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in neural information processing systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2301.03831\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in neural information processing systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.03831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

摘要

变形金刚是语言建模的事实标准，最近被应用于视觉任务。为了利用自然图像固有的空间冗余性，节省计算成本，本文将稀疏查询引入视觉变换中。具体来说，我们提出了一种用于视觉转换的动态粒度编码器，它可以自适应地为每个空间区域分配适当数量的查询。从而在保持高效率的同时，在判别区域内实现了细粒度表示。此外，动态粒度编码器与大多数视觉转换框架兼容。我们的编码器没有花哨的装饰，可以使最先进的视觉变压器将计算复杂性降低40%-60%，同时保持图像分类的相当性能。大量的目标检测和分割实验进一步证明了我们的方法的泛化性。代码可从https://github.com/StevenGrove/vtpack获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dynamic Grained Encoder for Vision Transformers

Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Advances in neural information processing systems

自引率

0.00%

发文量