Keyword Spotting using Time-Domain Features in a Temporal Convolutional Network

Emad A. Ibrahim, J. Huisken, H. Fatemi, J. P. D. Gyvez
{"title":"Keyword Spotting using Time-Domain Features in a Temporal Convolutional Network","authors":"Emad A. Ibrahim, J. Huisken, H. Fatemi, J. P. D. Gyvez","doi":"10.1109/DSD.2019.00053","DOIUrl":null,"url":null,"abstract":"With the increasing demand on voice recognition services, more attention is paid to simpler algorithms that are capable to run locally on a hardware device. This paper demonstrates simpler speech features derived in the time-domain for Keyword Spotting (KWS). The features are considered as constrained lag autocorrelations computed on overlapped speech frames to form a 2D map. We refer to this as Multi-Frame Shifted Time Similarity (MFSTS). MFSTS performance is compared against the widely known Mel-Frequency Cepstral Coefficients (MFCC) that are computed in the frequency-domain. A Temporal Convolutional Network (TCN) is designed to classify keywords using both MFCC and MFSTS. This is done by employing an open source dataset from Google Brain, containing ~ 106000 files of one-second recorded words such as, 'Backward', 'Forward', 'Stop' etc. Initial findings show that MFSTS can be used for KWS tasks without visiting the frequency-domain. Our experimental results show that classification of the whole dataset (25 classes) based on MFCC and MFSTS are in a very good agreement. We compare the performance of the TCNbased classifier with other related work in the literature. The classification is performed using small memory footprint (~ 90 KB) and low compute power (~ 5 MOPs) per inference. The achieved classification accuracies are 93.4% using MFCC and 91.2% using MFSTS. Furthermore, a case study is provided for a single-keyword spotting task. The case study demonstrates how MFSTS can be used as a simple preprocessing scheme with small classifiers while achieving as high as 98% accuracy. The compute simplicity of MFSTS makes it attractive for low power KWS applications paving the way for resource-aware solutions.","PeriodicalId":217233,"journal":{"name":"2019 22nd Euromicro Conference on Digital System Design (DSD)","volume":"12 6","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 22nd Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD.2019.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

With the increasing demand on voice recognition services, more attention is paid to simpler algorithms that are capable to run locally on a hardware device. This paper demonstrates simpler speech features derived in the time-domain for Keyword Spotting (KWS). The features are considered as constrained lag autocorrelations computed on overlapped speech frames to form a 2D map. We refer to this as Multi-Frame Shifted Time Similarity (MFSTS). MFSTS performance is compared against the widely known Mel-Frequency Cepstral Coefficients (MFCC) that are computed in the frequency-domain. A Temporal Convolutional Network (TCN) is designed to classify keywords using both MFCC and MFSTS. This is done by employing an open source dataset from Google Brain, containing ~ 106000 files of one-second recorded words such as, 'Backward', 'Forward', 'Stop' etc. Initial findings show that MFSTS can be used for KWS tasks without visiting the frequency-domain. Our experimental results show that classification of the whole dataset (25 classes) based on MFCC and MFSTS are in a very good agreement. We compare the performance of the TCNbased classifier with other related work in the literature. The classification is performed using small memory footprint (~ 90 KB) and low compute power (~ 5 MOPs) per inference. The achieved classification accuracies are 93.4% using MFCC and 91.2% using MFSTS. Furthermore, a case study is provided for a single-keyword spotting task. The case study demonstrates how MFSTS can be used as a simple preprocessing scheme with small classifiers while achieving as high as 98% accuracy. The compute simplicity of MFSTS makes it attractive for low power KWS applications paving the way for resource-aware solutions.
基于时域特征的时间卷积网络关键字识别
随着人们对语音识别服务的需求不断增加,人们越来越关注能够在硬件设备上本地运行的更简单的算法。本文演示了在时域中导出的用于关键字识别(KWS)的简单语音特征。这些特征被认为是在重叠的语音帧上计算的约束滞后自相关,以形成二维映射。我们称之为多帧移位时间相似(MFSTS)。将MFSTS性能与众所周知的mel -频率倒谱系数(MFCC)进行比较,后者是在频域中计算的。设计了一种同时使用MFCC和MFSTS对关键词进行分类的时间卷积网络(TCN)。这是通过使用来自Google Brain的开源数据集来完成的,其中包含约106000个一秒钟记录的单词,如“后退”,“前进”,“停止”等。初步研究结果表明,MFSTS可以用于KWS任务,而无需访问频域。实验结果表明,基于MFCC和MFSTS对整个数据集(25个类)的分类具有很好的一致性。我们将基于tcn的分类器的性能与文献中的其他相关工作进行了比较。每次推理使用较小的内存占用(~ 90 KB)和较低的计算能力(~ 5 MOPs)执行分类。MFCC和MFSTS的分类准确率分别为93.4%和91.2%。此外,还提供了单个关键字查找任务的案例研究。案例研究表明,MFSTS可以作为一种简单的预处理方案,使用小型分类器,同时达到高达98%的准确率。MFSTS的计算简单性使其对低功耗KWS应用具有吸引力,为资源感知解决方案铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信