TDC:通过硬件感知的Tucker分解在gpu上实现极其高效的cnn

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2022-11-07 DOI:10.1145/3572848.3577478

Lizhi Xiang, Miao Yin, Chengming Zhang, Aravind Sukumaran-Rajam, P. Sadayappan, Bo Yuan, Dingwen Tao

{"title":"TDC:通过硬件感知的Tucker分解在gpu上实现极其高效的cnn","authors":"Lizhi Xiang, Miao Yin, Chengming Zhang, Aravind Sukumaran-Rajam, P. Sadayappan, Bo Yuan, Dingwen Tao","doi":"10.1145/3572848.3577478","DOIUrl":null,"url":null,"abstract":"Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21× speedup over cuDNN, 1.12× speedup over TVM, and 3.27× over the original models using cuDNN with at most 0.05% accuracy loss.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"119 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition\",\"authors\":\"Lizhi Xiang, Miao Yin, Chengming Zhang, Aravind Sukumaran-Rajam, P. Sadayappan, Bo Yuan, Dingwen Tao\",\"doi\":\"10.1145/3572848.3577478\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21× speedup over cuDNN, 1.12× speedup over TVM, and 3.27× over the original models using cuDNN with at most 0.05% accuracy loss.\",\"PeriodicalId\":233744,\"journal\":{\"name\":\"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming\",\"volume\":\"119 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3572848.3577478\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3572848.3577478","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

塔克分解是SOTA CNN模型压缩技术之一。然而，与FLOPs减少不同，我们观察到使用现有GPU软件(如cuDNN)使用tucker压缩模型减少的推理时间非常有限。为此，我们提出了一个高效的端到端框架，该框架可以通过gpu上的Tucker分解和优化的推理代码生成高度精确和紧凑的CNN模型。具体来说，我们提出了一种基于admm的训练算法，可以实现高精度的塔克格式模型。我们还为塔克格式卷积和分析性能模型开发了高性能内核，以指导执行参数的选择。我们进一步提出了一个协同设计框架，以确定由实际推理时间(而不是FLOPs)驱动的适当的Tucker排名。我们对使用A100的五个现代cnn的评估表明，使用优化代码的压缩模型比cuDNN的加速提高了2.21倍，比TVM的加速提高了1.12倍，比使用cuDNN的原始模型的加速提高了3.27倍，精度损失最多为0.05%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21× speedup over cuDNN, 1.12× speedup over TVM, and 3.27× over the original models using cuDNN with at most 0.05% accuracy loss.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量