Keyword Spotting using Time-Domain Features in a Temporal Convolutional Network

2019 22nd Euromicro Conference on Digital System Design (DSD) Pub Date : 2019-08-01 DOI:10.1109/DSD.2019.00053

Emad A. Ibrahim, J. Huisken, H. Fatemi, J. P. D. Gyvez

{"title":"Keyword Spotting using Time-Domain Features in a Temporal Convolutional Network","authors":"Emad A. Ibrahim, J. Huisken, H. Fatemi, J. P. D. Gyvez","doi":"10.1109/DSD.2019.00053","DOIUrl":null,"url":null,"abstract":"With the increasing demand on voice recognition services, more attention is paid to simpler algorithms that are capable to run locally on a hardware device. This paper demonstrates simpler speech features derived in the time-domain for Keyword Spotting (KWS). The features are considered as constrained lag autocorrelations computed on overlapped speech frames to form a 2D map. We refer to this as Multi-Frame Shifted Time Similarity (MFSTS). MFSTS performance is compared against the widely known Mel-Frequency Cepstral Coefficients (MFCC) that are computed in the frequency-domain. A Temporal Convolutional Network (TCN) is designed to classify keywords using both MFCC and MFSTS. This is done by employing an open source dataset from Google Brain, containing ~ 106000 files of one-second recorded words such as, 'Backward', 'Forward', 'Stop' etc. Initial findings show that MFSTS can be used for KWS tasks without visiting the frequency-domain. Our experimental results show that classification of the whole dataset (25 classes) based on MFCC and MFSTS are in a very good agreement. We compare the performance of the TCNbased classifier with other related work in the literature. The classification is performed using small memory footprint (~ 90 KB) and low compute power (~ 5 MOPs) per inference. The achieved classification accuracies are 93.4% using MFCC and 91.2% using MFSTS. Furthermore, a case study is provided for a single-keyword spotting task. The case study demonstrates how MFSTS can be used as a simple preprocessing scheme with small classifiers while achieving as high as 98% accuracy. The compute simplicity of MFSTS makes it attractive for low power KWS applications paving the way for resource-aware solutions.","PeriodicalId":217233,"journal":{"name":"2019 22nd Euromicro Conference on Digital System Design (DSD)","volume":"12 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 22nd Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD.2019.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

With the increasing demand on voice recognition services, more attention is paid to simpler algorithms that are capable to run locally on a hardware device. This paper demonstrates simpler speech features derived in the time-domain for Keyword Spotting (KWS). The features are considered as constrained lag autocorrelations computed on overlapped speech frames to form a 2D map. We refer to this as Multi-Frame Shifted Time Similarity (MFSTS). MFSTS performance is compared against the widely known Mel-Frequency Cepstral Coefficients (MFCC) that are computed in the frequency-domain. A Temporal Convolutional Network (TCN) is designed to classify keywords using both MFCC and MFSTS. This is done by employing an open source dataset from Google Brain, containing ~ 106000 files of one-second recorded words such as, 'Backward', 'Forward', 'Stop' etc. Initial findings show that MFSTS can be used for KWS tasks without visiting the frequency-domain. Our experimental results show that classification of the whole dataset (25 classes) based on MFCC and MFSTS are in a very good agreement. We compare the performance of the TCNbased classifier with other related work in the literature. The classification is performed using small memory footprint (~ 90 KB) and low compute power (~ 5 MOPs) per inference. The achieved classification accuracies are 93.4% using MFCC and 91.2% using MFSTS. Furthermore, a case study is provided for a single-keyword spotting task. The case study demonstrates how MFSTS can be used as a simple preprocessing scheme with small classifiers while achieving as high as 98% accuracy. The compute simplicity of MFSTS makes it attractive for low power KWS applications paving the way for resource-aware solutions.

查看原文本刊更多论文

基于时域特征的时间卷积网络关键字识别

随着人们对语音识别服务的需求不断增加，人们越来越关注能够在硬件设备上本地运行的更简单的算法。本文演示了在时域中导出的用于关键字识别(KWS)的简单语音特征。这些特征被认为是在重叠的语音帧上计算的约束滞后自相关，以形成二维映射。我们称之为多帧移位时间相似(MFSTS)。将MFSTS性能与众所周知的mel -频率倒谱系数(MFCC)进行比较，后者是在频域中计算的。设计了一种同时使用MFCC和MFSTS对关键词进行分类的时间卷积网络(TCN)。这是通过使用来自Google Brain的开源数据集来完成的，其中包含约106000个一秒钟记录的单词，如“后退”，“前进”，“停止”等。初步研究结果表明，MFSTS可以用于KWS任务，而无需访问频域。实验结果表明，基于MFCC和MFSTS对整个数据集(25个类)的分类具有很好的一致性。我们将基于tcn的分类器的性能与文献中的其他相关工作进行了比较。每次推理使用较小的内存占用(~ 90 KB)和较低的计算能力(~ 5 MOPs)执行分类。MFCC和MFSTS的分类准确率分别为93.4%和91.2%。此外，还提供了单个关键字查找任务的案例研究。案例研究表明，MFSTS可以作为一种简单的预处理方案，使用小型分类器，同时达到高达98%的准确率。MFSTS的计算简单性使其对低功耗KWS应用具有吸引力，为资源感知解决方案铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 22nd Euromicro Conference on Digital System Design (DSD)

自引率

0.00%

发文量