Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets

International Conference on Text, Speech and Dialogue Pub Date : 2022-07-13 DOI:10.48550/arXiv.2207.06920

Lu Zeng, S. Parthasarathi, Yuzong Liu, Alex Escott, S. Cheekatmalla, N. Strom, S. Vitaladevuni

{"title":"Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded Chipsets","authors":"Lu Zeng, S. Parthasarathi, Yuzong Liu, Alex Escott, S. Cheekatmalla, N. Strom, S. Vitaladevuni","doi":"10.48550/arXiv.2207.06920","DOIUrl":null,"url":null,"abstract":". We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1 st -stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh ( . ) on dense layer weights. In the 2 nd -stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identiﬁed production, far-ﬁeld and near-ﬁeld audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with oﬀ-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both conﬁgurations, our results show that the proposed algorithm can achieve: a) parity with a full ﬂoating point model’s operating point on a detection error tradeoﬀ (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) signiﬁcant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.","PeriodicalId":358274,"journal":{"name":"International Conference on Text, Speech and Dialogue","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Text, Speech and Dialogue","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.06920","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

. We propose a novel 2-stage sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model. For the 1 st -stage, we adapt a recently proposed quantization technique using a non-linear transformation with tanh ( . ) on dense layer weights. In the 2 nd -stage, we use linear quantization methods on the rest of the network, including other parameters (bias, gain, batchnorm), inputs, and activations. We conduct large scale experiments, training on 26,000 hours of de-identiﬁed production, far-ﬁeld and near-ﬁeld audio data (evaluating on 4,000 hours of data). We organize our results in two embedded chipset settings: a) with commodity ARM NEON instruction set and 8-bit containers, we present accuracy, CPU, and memory results using sub 8-bit weights (4, 5, 8-bit) and 8-bit quantization of rest of the network; b) with oﬀ-the-shelf neural network accelerators, for a range of weight bit widths (1 and 5-bit), while presenting accuracy results, we project reduction in memory utilization. In both conﬁgurations, our results show that the proposed algorithm can achieve: a) parity with a full ﬂoating point model’s operating point on a detection error tradeoﬀ (DET) curve in terms of false detection rate (FDR) at false rejection rate (FRR); b) signiﬁcant reduction in compute and memory, yielding up to 3 times improvement in CPU consumption and more than 4 times improvement in memory consumption.

查看原文本刊更多论文

嵌入式芯片组流关键字定位模型的亚8位量化

．针对250K参数前馈、流式、无状态关键字识别模型的所有组件，我们提出了一种新的2阶段8位次量化感知训练算法。对于第一阶段，我们采用了最近提出的使用tanh()的非线性变换的量化技术。在密集层的权重上。在第二阶段，我们对网络的其余部分使用线性量化方法，包括其他参数(偏置、增益、批范数)、输入和激活。我们进行了大规模的实验，对26000小时的去识别生产、远场和近场音频数据进行了培训(对4000小时的数据进行了评估)。我们在两种嵌入式芯片组设置中组织我们的结果:a)使用商用ARM NEON指令集和8位容器，我们使用子8位权重(4,5,8位)和网络其余部分的8位量化来呈现精度，CPU和内存结果;B)使用现成的神经网络加速器，用于权重位宽度(1位和5位)的范围，在呈现准确性结果的同时，我们预计内存利用率会降低。在两种配置下，我们的结果表明，所提出的算法可以实现:a)在误检率(FDR)和误拒率(FRR)方面，在检测误差权衡(DET)曲线上与全浮点模型的工作点的奇偶性;b)显著减少计算和内存，使CPU消耗提高3倍，内存消耗提高4倍以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Text, Speech and Dialogue

自引率

0.00%

发文量