UCC：视觉变压器模型的统一级联压缩框架

IF 5.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2024-10-18 DOI:10.1016/j.neucom.2024.128747

Dingfu Chen , Kangwei Lin , Qingxu Deng

{"title":"UCC：视觉变压器模型的统一级联压缩框架","authors":"Dingfu Chen , Kangwei Lin , Qingxu Deng","doi":"10.1016/j.neucom.2024.128747","DOIUrl":null,"url":null,"abstract":"<div><div>In recent years, Vision Transformer (ViT) and its variants have dominated many computer vision tasks. However, the high computational consumption and training data requirements of ViT make it challenging to be deployed directly on resource-constrained devices and environments. Model compression is an effective approach to accelerate deep learning networks, but existing methods for compressing ViT models are limited in their scopes and struggle to strike a balance between performance and computational cost. In this paper, we propose a novel Unified Cascaded Compression Framework (UCC) to compress ViT in a more precise and efficient manner. Specifically, we first analyze the frequency information within tokens and prune them based on a joint score of their both spatial and spectral characteristics. Subsequently, we propose a similarity-based token aggregation scheme that combines the abundant contextual information contained in all pruned tokens with the host tokens according to their weights. Additionally, we introduce a novel cumulative cascaded pruning strategy that performs bottom-up cascaded pruning of tokens based on cumulative scores, avoiding information loss caused by individual idiosyncrasies of blocks. Finally, we design a novel two-level distillation strategy, incorporating imitation and exploration, to ensure the diversity of knowledge and better performance recovery. Extensive experiments demonstrate that our unified cascaded compression framework outperforms most existing state-of-the-art approaches, compresses the floating-point operations of ViT-Base as well as DeiT-Base models by 22 % and 54.1 %, and improves the recognition accuracy of the models by 3.74 % and 1.89 %, respectively, significantly reducing model computational consumption while enhancing performance, which enables efficient end-to-end training of compact ViT models.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"612 ","pages":"Article 128747"},"PeriodicalIF":5.5000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"UCC: A unified cascade compression framework for vision transformer models\",\"authors\":\"Dingfu Chen , Kangwei Lin , Qingxu Deng\",\"doi\":\"10.1016/j.neucom.2024.128747\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In recent years, Vision Transformer (ViT) and its variants have dominated many computer vision tasks. However, the high computational consumption and training data requirements of ViT make it challenging to be deployed directly on resource-constrained devices and environments. Model compression is an effective approach to accelerate deep learning networks, but existing methods for compressing ViT models are limited in their scopes and struggle to strike a balance between performance and computational cost. In this paper, we propose a novel Unified Cascaded Compression Framework (UCC) to compress ViT in a more precise and efficient manner. Specifically, we first analyze the frequency information within tokens and prune them based on a joint score of their both spatial and spectral characteristics. Subsequently, we propose a similarity-based token aggregation scheme that combines the abundant contextual information contained in all pruned tokens with the host tokens according to their weights. Additionally, we introduce a novel cumulative cascaded pruning strategy that performs bottom-up cascaded pruning of tokens based on cumulative scores, avoiding information loss caused by individual idiosyncrasies of blocks. Finally, we design a novel two-level distillation strategy, incorporating imitation and exploration, to ensure the diversity of knowledge and better performance recovery. Extensive experiments demonstrate that our unified cascaded compression framework outperforms most existing state-of-the-art approaches, compresses the floating-point operations of ViT-Base as well as DeiT-Base models by 22 % and 54.1 %, and improves the recognition accuracy of the models by 3.74 % and 1.89 %, respectively, significantly reducing model computational consumption while enhancing performance, which enables efficient end-to-end training of compact ViT models.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"612 \",\"pages\":\"Article 128747\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-10-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224015182\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015182","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

近年来，视觉转换器（ViT）及其变体在许多计算机视觉任务中占据了主导地位。然而，ViT 的高计算消耗和训练数据要求使其难以直接部署在资源受限的设备和环境中。模型压缩是加速深度学习网络的有效方法，但现有的 ViT 模型压缩方法范围有限，难以在性能和计算成本之间取得平衡。在本文中，我们提出了一种新颖的统一级联压缩框架（UCC），以更精确、更高效的方式压缩 ViT。具体来说，我们首先分析标记内的频率信息，并根据其空间和频谱特征的联合评分对其进行剪切。随后，我们提出了一种基于相似性的标记聚合方案，该方案将所有剪切过的标记中包含的丰富上下文信息与主标记根据其权重结合在一起。此外，我们还引入了一种新颖的累积级联剪枝策略，根据累积分数对标记进行自下而上的级联剪枝，避免了因区块的个体特异性而造成的信息损失。最后，我们设计了一种新颖的两级提炼策略，将模仿和探索结合在一起，以确保知识的多样性和更好的性能恢复。广泛的实验证明，我们的统一级联压缩框架优于大多数现有的先进方法，将 ViT-Base 和 DeiT-Base 模型的浮点运算压缩了 22 % 和 54.1 %，并将模型的识别准确率分别提高了 3.74 % 和 1.89 %，在提高性能的同时显著降低了模型的计算消耗，从而实现了紧凑型 ViT 模型的高效端到端训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

UCC: A unified cascade compression framework for vision transformer models

In recent years, Vision Transformer (ViT) and its variants have dominated many computer vision tasks. However, the high computational consumption and training data requirements of ViT make it challenging to be deployed directly on resource-constrained devices and environments. Model compression is an effective approach to accelerate deep learning networks, but existing methods for compressing ViT models are limited in their scopes and struggle to strike a balance between performance and computational cost. In this paper, we propose a novel Unified Cascaded Compression Framework (UCC) to compress ViT in a more precise and efficient manner. Specifically, we first analyze the frequency information within tokens and prune them based on a joint score of their both spatial and spectral characteristics. Subsequently, we propose a similarity-based token aggregation scheme that combines the abundant contextual information contained in all pruned tokens with the host tokens according to their weights. Additionally, we introduce a novel cumulative cascaded pruning strategy that performs bottom-up cascaded pruning of tokens based on cumulative scores, avoiding information loss caused by individual idiosyncrasies of blocks. Finally, we design a novel two-level distillation strategy, incorporating imitation and exploration, to ensure the diversity of knowledge and better performance recovery. Extensive experiments demonstrate that our unified cascaded compression framework outperforms most existing state-of-the-art approaches, compresses the floating-point operations of ViT-Base as well as DeiT-Base models by 22 % and 54.1 %, and improves the recognition accuracy of the models by 3.74 % and 1.89 %, respectively, significantly reducing model computational consumption while enhancing performance, which enables efficient end-to-end training of compact ViT models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.