SumMerge:权重重复感知深度神经网络推理的有效算法和实现

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing Pub Date : 2021-06-03 DOI:10.1145/3447818.3460375

Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher

{"title":"SumMerge:权重重复感知深度神经网络推理的有效算法和实现","authors":"Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher","doi":"10.1145/3447818.3460375","DOIUrl":null,"url":null,"abstract":"Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit \\emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs \\emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.","PeriodicalId":73273,"journal":{"name":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference\",\"authors\":\"Rohan Baskar Prabhakar, Sachit Kuhar, R. Agrawal, C. Hughes, Christopher W. Fletcher\",\"doi\":\"10.1145/3447818.3460375\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit \\\\emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs \\\\emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.\",\"PeriodicalId\":73273,\"journal\":{\"name\":\"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing\",\"volume\":\"47 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-06-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3447818.3460375\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447818.3460375","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

深度神经网络(DNN)的推理效率是目前依赖深度学习的众多领域的一个关键问题。最近一个有希望的加速推理的方向是利用\emph{权重重复}。关键的观察结果是，由于DNN量化方案——试图通过减少表示每个权重所需的比特数来减少DNN存储需求——相同的权重必然会在过滤器内部和跨过滤器重复多次。这使得权重重复感知推理内核能够分解和记忆常见的子计算，减少每个推理的算术，同时仍然保持量化的压缩优势。然而，重大挑战依然存在。例如，权重重复在推理操作中引入了明显的不规则性，因此(到目前为止)需要定制硬件加速器来获得净收益。本文提出了SumMerge:一种新的算法和一套实现技术，使权重重复在cpu等通用设备上实现。关键思想是将推理表述为遍历\emph{具有权重相关结构}的数据流图序列。我们开发了一种离线启发式方法来选择一个数据流图结构，该结构可以最大限度地减少每个推理(给定训练过的权重值)的算术运算，并使用一个有效的在线过程来遍历每个数据流图并计算给定DNN输入的推理结果。我们将上述实现为一个优化的c++例程，该例程运行在带有矢量扩展的商用多核处理器上，并相对于英特尔优化的库oneDNN和现有技术的权重重复算法(AGR)评估性能。当应用于六种不同的量化方案时，SumMerge相对于oneDNN和AGR分别实现了1.09 -2.05倍和1.04 -1.51倍的加速，同时将DNN模型压缩了8.7倍至15.4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference

Deep Neural Network (DNN) inference efficiency is a key concern across the myriad of domains now relying on Deep Learning. A recent promising direction to speed-up inference is to exploit \emph{weight repetition}. The key observation is that due to DNN quantization schemes---which attempt to reduce DNN storage requirements by reducing the number of bits needed to represent each weight---the same weight is bound to repeat many times within and across filters. This enables a weight-repetition aware inference kernel to factorize and memoize out common sub-computations, reducing arithmetic per inference while still maintaining the compression benefits of quantization. Yet, significant challenges remain. For instance, weight repetition introduces significant irregularity in the inference operation and hence (up to this point) has required custom hardware accelerators to derive net benefit. This paper proposes SumMerge: a new algorithm and set of implementation techniques to make weight repetition practical on general-purpose devices such as CPUs. The key idea is to formulate inference as traversing a sequence of data-flow graphs \emph{with weight-dependent structure}. We develop an offline heuristic to select a data-flow graph structure that minimizes arithmetic operations per inference (given trained weight values) and use an efficient online procedure to traverse each data-flow graph and compute the inference result given DNN inputs. We implement the above as an optimized C++ routine that runs on a commercial multicore processor with vector extensions and evaluate performance relative to Intel's optimized library oneDNN and the prior-art weight repetition algorithm (AGR). When applied on top of six different quantization schemes, SumMerge achieves a speedup of between 1.09x-2.05x and 1.04x-1.51x relative to oneDNN and AGR, respectively, while simultaneously compressing the DNN model by 8.7x to 15.4x.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ICS ... : proceedings of the ... ACM International Conference on Supercomputing. International Conference on Supercomputing

自引率

0.00%

发文量