CSB-RNN: a faster-than-realtime RNN acceleration framework with compressed structured blocks

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-05-11 DOI:10.1145/3392717.3392749

Runbin Shi, Peiyan Dong, Tong Geng, Yuhao Ding, Xiaolong Ma, Hayden Kwok-Hay So, M. Herbordt, Ang Li, Yanzhi Wang

{"title":"CSB-RNN: a faster-than-realtime RNN acceleration framework with compressed structured blocks","authors":"Runbin Shi, Peiyan Dong, Tong Geng, Yuhao Ding, Xiaolong Ma, Hayden Kwok-Hay So, M. Herbordt, Ang Li, Yanzhi Wang","doi":"10.1145/3392717.3392749","DOIUrl":null,"url":null,"abstract":"Recurrent neural networks (RNNs) have been widely adopted in temporal sequence analysis, where realtime performance is often in demand. However, RNNs suffer from heavy computational workload as the model often comes with large weight matrices. Pruning (a model compression method) schemes have been proposed for RNNs to eliminate the redundant (close-to-zero) weight values. On one hand, the non-structured pruning methods achieve a high pruning rate but introducing computation irregularity (random sparsity), which is unfriendly to parallel hardware. On the other hand, hardware-oriented structured pruning suffers from low pruning rate due to restricted constraints on allowable pruning structure. This paper presents CSB-RNN, an optimized full-stack RNN framework with a novel compressed structured block (CSB) pruning technique. The CSB pruned RNN model comes with both fine pruning granularity that facilitates a high pruning rate and regular structure that benefits the hardware parallelism. To address the challenges in parallelizing the CSB pruned model inference with fine-grained structural sparsity, we propose a novel hardware architecture with a dedicated compiler. Gaining from the architecture-compilation co-design, the hardware not only supports various RNN cell types, but is also able to address the challenging workload imbalance issue and therefore significantly improves the hardware efficiency (utilization). Compared to the vanilla design without optimizations, the hardware utilization has been enhanced by over 2X. With experiments on 10 RNN models from multiple application domains, CSB pruning demonstrates 3.5X-25X lossless pruning rate, which is 1.6X to 3.9X over existing designs. With several other innovations applied, the CSB-RNN inference can achieve faster-than-realtime latency of 0.79μs-6.58μs in an FPGA implementation, which contributes to 1.12X-12.57X lower latency and 3.53X-58.89X improvement on power-efficiency over the state-of-the-art.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th ACM International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3392717.3392749","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Recurrent neural networks (RNNs) have been widely adopted in temporal sequence analysis, where realtime performance is often in demand. However, RNNs suffer from heavy computational workload as the model often comes with large weight matrices. Pruning (a model compression method) schemes have been proposed for RNNs to eliminate the redundant (close-to-zero) weight values. On one hand, the non-structured pruning methods achieve a high pruning rate but introducing computation irregularity (random sparsity), which is unfriendly to parallel hardware. On the other hand, hardware-oriented structured pruning suffers from low pruning rate due to restricted constraints on allowable pruning structure. This paper presents CSB-RNN, an optimized full-stack RNN framework with a novel compressed structured block (CSB) pruning technique. The CSB pruned RNN model comes with both fine pruning granularity that facilitates a high pruning rate and regular structure that benefits the hardware parallelism. To address the challenges in parallelizing the CSB pruned model inference with fine-grained structural sparsity, we propose a novel hardware architecture with a dedicated compiler. Gaining from the architecture-compilation co-design, the hardware not only supports various RNN cell types, but is also able to address the challenging workload imbalance issue and therefore significantly improves the hardware efficiency (utilization). Compared to the vanilla design without optimizations, the hardware utilization has been enhanced by over 2X. With experiments on 10 RNN models from multiple application domains, CSB pruning demonstrates 3.5X-25X lossless pruning rate, which is 1.6X to 3.9X over existing designs. With several other innovations applied, the CSB-RNN inference can achieve faster-than-realtime latency of 0.79μs-6.58μs in an FPGA implementation, which contributes to 1.12X-12.57X lower latency and 3.53X-58.89X improvement on power-efficiency over the state-of-the-art.

查看原文本刊更多论文

CSB-RNN:一个比实时更快的RNN加速框架，具有压缩的结构化块

递归神经网络(RNNs)在时间序列分析中得到了广泛的应用。然而，rnn的计算工作量很大，因为模型通常带有较大的权重矩阵。为了消除冗余的(接近于零的)权重值，已经提出了rnn的修剪(一种模型压缩方法)方案。一方面，非结构化剪枝方法具有较高的剪枝率，但引入了计算不规则性(随机稀疏性)，对并行硬件不友好。另一方面，面向硬件的结构化剪枝由于对允许剪枝结构的约束有限，剪枝率较低。本文提出了一种基于压缩结构块(CSB)剪叶技术的优化全栈RNN框架CSB-RNN。CSB剪枝RNN模型既具有有利于高剪枝率的精细剪枝粒度，又具有有利于硬件并行性的规则结构。为了解决具有细粒度结构稀疏性的CSB修剪模型推理并行化的挑战，我们提出了一种具有专用编译器的新型硬件架构。得益于架构-编译协同设计，硬件不仅支持各种RNN单元类型，而且能够解决具有挑战性的工作负载不平衡问题，从而显着提高硬件效率(利用率)。与没有优化的普通设计相比，硬件利用率提高了2倍以上。通过对来自多个应用领域的10个RNN模型的实验，CSB修剪显示出3.5 - 25x的无损修剪率，比现有设计提高了1.6 - 3.9倍。随着其他一些创新的应用，CSB-RNN推理可以在FPGA实现中实现0.79μs-6.58μs的超实时延迟，这使得延迟降低了1.12X-12.57X，功率效率提高了3.53X-58.89X。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 34th ACM International Conference on Supercomputing

自引率

0.00%

发文量