Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets

Lu Zeng, S. Parthasarathi, Yuzong Liu, Alex Escott, S. Cheekatmalla, N. Strom, S. Vitaladevuni
{"title":"Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets","authors":"Lu Zeng, S. Parthasarathi, Yuzong Liu, Alex Escott, S. Cheekatmalla, N. Strom, S. Vitaladevuni","doi":"10.48550/arXiv.2207.06920","DOIUrl":null,"url":null,"abstract":". We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1 st -stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh ( . ) on dense layer weights. In the 2 nd -stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with off-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both configurations, our results show that the proposed algorithm can achieve: a) parity with a full floating point model’s operating point on a detection error tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) significant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.","PeriodicalId":358274,"journal":{"name":"International Conference on Text, Speech and Dialogue","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Text, Speech and Dialogue","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.06920","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

. We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1 st -stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh ( . ) on dense layer weights. In the 2 nd -stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with off-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both configurations, our results show that the proposed algorithm can achieve: a) parity with a full floating point model’s operating point on a detection error tradeoff (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) significant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.
嵌入式芯片组流关键字定位模型的亚8位量化
. 针对250K参数前馈、流式、无状态关键字识别模型的所有组件,我们提出了一种新的2阶段8位次量化感知训练算法。对于第一阶段,我们采用了最近提出的使用tanh()的非线性变换的量化技术。在密集层的权重上。在第二阶段,我们对网络的其余部分使用线性量化方法,包括其他参数(偏置、增益、批范数)、输入和激活。我们进行了大规模的实验,对26000小时的去识别生产、远场和近场音频数据进行了培训(对4000小时的数据进行了评估)。我们在两种嵌入式芯片组设置中组织我们的结果:a)使用商用ARM NEON指令集和8位容器,我们使用子8位权重(4,5,8位)和网络其余部分的8位量化来呈现精度,CPU和内存结果;B)使用现成的神经网络加速器,用于权重位宽度(1位和5位)的范围,在呈现准确性结果的同时,我们预计内存利用率会降低。在两种配置下,我们的结果表明,所提出的算法可以实现:a)在误检率(FDR)和误拒率(FRR)方面,在检测误差权衡(DET)曲线上与全浮点模型的工作点的奇偶性;b)显著减少计算和内存,使CPU消耗提高3倍,内存消耗提高4倍以上。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信