I/O lower bounds for auto-tuning of convolutions in CNNs

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2020-12-31 DOI:10.1145/3437801.3441609

Xiaoyang Zhang, Junmin Xiao, Guangming Tan

{"title":"I/O lower bounds for auto-tuning of convolutions in CNNs","authors":"Xiaoyang Zhang, Junmin Xiao, Guangming Tan","doi":"10.1145/3437801.3441609","DOIUrl":null,"url":null,"abstract":"Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model samples, the convolution suffers from high overhead on data movement (i.e., memory access). This work provides comprehensive analysis and methodologies to minimize the communication for the convolution in CNNs. With an in-depth analysis of the recent I/O complexity theory under the red-blue game model, we develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations. Based on the proposed theory, we establish the data movement lower bound results for two main convolution algorithms in CNNs, namely the direct convolution and Winograd algorithm, which represents the direct and indirect implementations of a convolution respectively. Next, derived from I/O lower bound results, we design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse. Furthermore, in order to push the envelope of performance of the near I/O-optimal dataflow strategies further, an aggressive design of auto-tuning based on I/O lower bounds, is proposed to search an optimal parameter configuration for the direct convolution and Winograd algorithm on GPU, such as the number of threads and the size of shared memory used in each thread block. Finally, experiment evaluation results on the direct convolution and Winograd algorithm show that our dataflow strategies with the auto-tuning approach can achieve about 3.32× performance speedup on average over cuDNN. In addition, compared with TVM, which represents the state-of-the-art technique for auto-tuning, not only our auto-tuning method based on I/O lower bounds can find the optimal parameter configuration faster, but also our solution has higher performance than the optimal solution provided by TVM.","PeriodicalId":124852,"journal":{"name":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3437801.3441609","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model samples, the convolution suffers from high overhead on data movement (i.e., memory access). This work provides comprehensive analysis and methodologies to minimize the communication for the convolution in CNNs. With an in-depth analysis of the recent I/O complexity theory under the red-blue game model, we develop a general I/O lower bound theory for a composite algorithm which consists of several different sub-computations. Based on the proposed theory, we establish the data movement lower bound results for two main convolution algorithms in CNNs, namely the direct convolution and Winograd algorithm, which represents the direct and indirect implementations of a convolution respectively. Next, derived from I/O lower bound results, we design the near I/O-optimal dataflow strategies for the two main convolution algorithms by fully exploiting the data reuse. Furthermore, in order to push the envelope of performance of the near I/O-optimal dataflow strategies further, an aggressive design of auto-tuning based on I/O lower bounds, is proposed to search an optimal parameter configuration for the direct convolution and Winograd algorithm on GPU, such as the number of threads and the size of shared memory used in each thread block. Finally, experiment evaluation results on the direct convolution and Winograd algorithm show that our dataflow strategies with the auto-tuning approach can achieve about 3.32× performance speedup on average over cuDNN. In addition, compared with TVM, which represents the state-of-the-art technique for auto-tuning, not only our auto-tuning method based on I/O lower bounds can find the optimal parameter configuration faster, but also our solution has higher performance than the optimal solution provided by TVM.

查看原文本刊更多论文

cnn中卷积自调优的I/O下界

卷积是卷积神经网络(convolutional neural network, cnn)计算中最耗时的部分，在众多的实际应用中取得了巨大的成功。由于复杂的数据依赖性和模型样本数量的增加，卷积在数据移动(即内存访问)上遭受了很高的开销。这项工作提供了全面的分析和方法，以尽量减少cnn中卷积的通信。在深入分析红蓝博弈模型下的I/O复杂度理论的基础上，提出了一种由若干不同子计算组成的复合算法的通用I/O下界理论。基于提出的理论，我们建立了cnn中两种主要卷积算法的数据移动下界结果，即直接卷积和Winograd算法，它们分别代表了卷积的直接实现和间接实现。其次，根据I/O下界结果，通过充分利用数据重用，为两种主要卷积算法设计了接近I/O最优的数据流策略。此外，为了进一步提高近I/O最优数据流策略的性能极限，提出了一种基于I/O下界的主动自动调优设计，在GPU上搜索直接卷积和Winograd算法的最优参数配置，如线程数和每个线程块使用的共享内存大小。最后，对直接卷积和Winograd算法的实验评估结果表明，采用自调优方法的数据流策略比cuDNN平均提高了3.32倍的性能。此外，与代表最先进自调优技术的TVM相比，我们基于I/O下界的自调优方法不仅可以更快地找到最优参数配置，而且我们的解比TVM提供的最优解具有更高的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量