Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC) Pub Date : 2021-01-18 DOI:10.1145/3394885.3431554

Sitao Huang, Aayush Ankit, P. Silveira, Rodrigo Antunes, S. R. Chalamalasetti, I. E. Hajj, Dong Eun Kim, G. Aguiar, P. Bruel, S. Serebryakov, Cong Xu, Can Li, P. Faraboschi, J. Strachan, Deming Chen, K. Roy, Wen-mei W. Hwu, D. Milojicic

{"title":"Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators","authors":"Sitao Huang, Aayush Ankit, P. Silveira, Rodrigo Antunes, S. R. Chalamalasetti, I. E. Hajj, Dong Eun Kim, G. Aguiar, P. Bruel, S. Serebryakov, Cong Xu, Can Li, P. Faraboschi, J. Strachan, Deming Chen, K. Roy, Wen-mei W. Hwu, D. Milojicic","doi":"10.1145/3394885.3431554","DOIUrl":null,"url":null,"abstract":"ReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these crossbars require the use of ADCs which constitute a significant fraction of the cost of MVM operations. The overhead of ADCs can be mitigated via partial sum quantization. However, prior quantization flows for DNN inference accelerators do not consider partial sum quantization which is not highly relevant to traditional digital architectures. To address this issue, we propose a mixed precision quantization scheme for ReRAM-based DNN inference accelerators where weight quantization, input quantization, and partial sum quantization are jointly applied for each DNN layer. We also propose an automated quantization flow powered by deep reinforcement learning to search for the best quantization configuration in the large design space. Our evaluation shows that the proposed mixed precision quantization scheme and quantization flow reduce inference latency and energy consumption by up to 3.89× and 4.84×, respectively, while only losing 1.18% in DNN inference accuracy.","PeriodicalId":186307,"journal":{"name":"2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3394885.3431554","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

ReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these crossbars require the use of ADCs which constitute a significant fraction of the cost of MVM operations. The overhead of ADCs can be mitigated via partial sum quantization. However, prior quantization flows for DNN inference accelerators do not consider partial sum quantization which is not highly relevant to traditional digital architectures. To address this issue, we propose a mixed precision quantization scheme for ReRAM-based DNN inference accelerators where weight quantization, input quantization, and partial sum quantization are jointly applied for each DNN layer. We also propose an automated quantization flow powered by deep reinforcement learning to search for the best quantization configuration in the large design space. Our evaluation shows that the proposed mixed precision quantization scheme and quantization flow reduce inference latency and energy consumption by up to 3.89× and 4.84×, respectively, while only losing 1.18% in DNN inference accuracy.

查看原文本刊更多论文

基于reram的DNN推理加速器的混合精度量化

基于ReRAM的加速器在加速DNN推理方面显示出巨大的潜力，因为ReRAM交叉棒可以以低延迟和低能耗执行模拟矩阵向量乘法运算。然而，这些横条需要使用adc，这构成了MVM操作成本的很大一部分。adc的开销可以通过部分和量化来减轻。然而，DNN推理加速器的先前量化流程没有考虑与传统数字架构不高度相关的部分和量化。为了解决这个问题，我们提出了一种基于reram的DNN推理加速器的混合精度量化方案，其中权重量化、输入量化和部分和量化联合应用于每个DNN层。我们还提出了一个由深度强化学习驱动的自动化量化流程，以在大设计空间中搜索最佳量化配置。我们的评估表明，提出的混合精度量化方案和量化流程分别减少了3.89倍和4.84倍的推理延迟和能量消耗，而DNN推理精度仅下降了1.18%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC)

自引率

0.00%

发文量