Investigation of attention mechanism for speech command recognition

IF 3 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Jie Xie, Mingying Zhu, Kai Hu, Jinglan Zhang, Ya Guo
{"title":"Investigation of attention mechanism for speech command recognition","authors":"Jie Xie, Mingying Zhu, Kai Hu, Jinglan Zhang, Ya Guo","doi":"10.1007/s11042-024-20129-7","DOIUrl":null,"url":null,"abstract":"<p>As an application area of speech command recognition, the smart home has provided people with a convenient way to communicate with various digital devices. Deep learning has demonstrated its effectiveness in speech command recognition. However, few studies have conducted extensive research on leveraging attention mechanisms to enhance its performance. In this study, we aim to investigate the deep learning architectures for improved speaker-independent speech command recognition. Specifically, we first compare the log-Mel-spectrogram and log-Gammatone spectrogram using VGG style and VGG-skip style networks. Next, the best-performing model is selected and investigated using different attention mechanisms including channel-time attention, channel-frequency attention, and channel-time-frequency attention. Finally, a dual CNN with cross-attention is used for speech command classification. A self-made dataset including 40 participants with 12 classes is used for the experiment which are all recorded in Mandarin Chinese, utilizing a variety of smartphone devices across diverse settings. Experimental results indicate that using log-Gammatone spectrogram and VGG-skip style networks with cross attention can achieve the best performance, where the accuracy, precision, recall and F1-score are 94.59%, 95.84%, 94.64%, and 94.57%, respectively.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":null,"pages":null},"PeriodicalIF":3.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Tools and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11042-024-20129-7","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

As an application area of speech command recognition, the smart home has provided people with a convenient way to communicate with various digital devices. Deep learning has demonstrated its effectiveness in speech command recognition. However, few studies have conducted extensive research on leveraging attention mechanisms to enhance its performance. In this study, we aim to investigate the deep learning architectures for improved speaker-independent speech command recognition. Specifically, we first compare the log-Mel-spectrogram and log-Gammatone spectrogram using VGG style and VGG-skip style networks. Next, the best-performing model is selected and investigated using different attention mechanisms including channel-time attention, channel-frequency attention, and channel-time-frequency attention. Finally, a dual CNN with cross-attention is used for speech command classification. A self-made dataset including 40 participants with 12 classes is used for the experiment which are all recorded in Mandarin Chinese, utilizing a variety of smartphone devices across diverse settings. Experimental results indicate that using log-Gammatone spectrogram and VGG-skip style networks with cross attention can achieve the best performance, where the accuracy, precision, recall and F1-score are 94.59%, 95.84%, 94.64%, and 94.57%, respectively.

Abstract Image

语音命令识别的注意机制研究
作为语音命令识别的一个应用领域,智能家居为人们提供了与各种数字设备交流的便捷方式。深度学习在语音命令识别中的有效性已得到证实。然而,很少有研究对利用注意力机制来提高其性能进行广泛研究。在本研究中,我们旨在研究深度学习架构,以提高与说话人无关的语音命令识别能力。具体来说,我们首先使用 VGG 风格和 VGG-skip 风格网络对 log-Mel 频谱图和 log-Gammatone 频谱图进行比较。然后,选出表现最佳的模型,并使用不同的注意机制进行研究,包括信道-时间注意、信道-频率注意和信道-时间-频率注意。最后,使用具有交叉注意力的双 CNN 进行语音命令分类。实验使用了一个自制的数据集,其中包括 40 名参与者和 12 个类别,这些数据都是用普通话录制的,在不同的环境下使用了各种智能手机设备。实验结果表明,使用具有交叉注意力的对数-伽马通频谱图和 VGG-skip 风格网络可以获得最佳性能,准确率、精确度、召回率和 F1 分数分别为 94.59%、95.84%、94.64% 和 94.57%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Multimedia Tools and Applications
Multimedia Tools and Applications 工程技术-工程:电子与电气
CiteScore
7.20
自引率
16.70%
发文量
2439
审稿时长
9.2 months
期刊介绍: Multimedia Tools and Applications publishes original research articles on multimedia development and system support tools as well as case studies of multimedia applications. It also features experimental and survey articles. The journal is intended for academics, practitioners, scientists and engineers who are involved in multimedia system research, design and applications. All papers are peer reviewed. Specific areas of interest include: - Multimedia Tools: - Multimedia Applications: - Prototype multimedia systems and platforms
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信