Investigation of attention mechanism for speech command recognition

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Multimedia Tools and Applications Pub Date : 2024-09-02 DOI:10.1007/s11042-024-20129-7

Jie Xie, Mingying Zhu, Kai Hu, Jinglan Zhang, Ya Guo

{"title":"Investigation of attention mechanism for speech command recognition","authors":"Jie Xie, Mingying Zhu, Kai Hu, Jinglan Zhang, Ya Guo","doi":"10.1007/s11042-024-20129-7","DOIUrl":null,"url":null,"abstract":"<p>As an application area of speech command recognition, the smart home has provided people with a convenient way to communicate with various digital devices. Deep learning has demonstrated its effectiveness in speech command recognition. However, few studies have conducted extensive research on leveraging attention mechanisms to enhance its performance. In this study, we aim to investigate the deep learning architectures for improved speaker-independent speech command recognition. Specifically, we first compare the log-Mel-spectrogram and log-Gammatone spectrogram using VGG style and VGG-skip style networks. Next, the best-performing model is selected and investigated using different attention mechanisms including channel-time attention, channel-frequency attention, and channel-time-frequency attention. Finally, a dual CNN with cross-attention is used for speech command classification. A self-made dataset including 40 participants with 12 classes is used for the experiment which are all recorded in Mandarin Chinese, utilizing a variety of smartphone devices across diverse settings. Experimental results indicate that using log-Gammatone spectrogram and VGG-skip style networks with cross attention can achieve the best performance, where the accuracy, precision, recall and F1-score are 94.59%, 95.84%, 94.64%, and 94.57%, respectively.</p>","PeriodicalId":18770,"journal":{"name":"Multimedia Tools and Applications","volume":"7 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Tools and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11042-024-20129-7","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

As an application area of speech command recognition, the smart home has provided people with a convenient way to communicate with various digital devices. Deep learning has demonstrated its effectiveness in speech command recognition. However, few studies have conducted extensive research on leveraging attention mechanisms to enhance its performance. In this study, we aim to investigate the deep learning architectures for improved speaker-independent speech command recognition. Specifically, we first compare the log-Mel-spectrogram and log-Gammatone spectrogram using VGG style and VGG-skip style networks. Next, the best-performing model is selected and investigated using different attention mechanisms including channel-time attention, channel-frequency attention, and channel-time-frequency attention. Finally, a dual CNN with cross-attention is used for speech command classification. A self-made dataset including 40 participants with 12 classes is used for the experiment which are all recorded in Mandarin Chinese, utilizing a variety of smartphone devices across diverse settings. Experimental results indicate that using log-Gammatone spectrogram and VGG-skip style networks with cross attention can achieve the best performance, where the accuracy, precision, recall and F1-score are 94.59%, 95.84%, 94.64%, and 94.57%, respectively.

Abstract Image

查看原文本刊更多论文

语音命令识别的注意机制研究

作为语音命令识别的一个应用领域，智能家居为人们提供了与各种数字设备交流的便捷方式。深度学习在语音命令识别中的有效性已得到证实。然而，很少有研究对利用注意力机制来提高其性能进行广泛研究。在本研究中，我们旨在研究深度学习架构，以提高与说话人无关的语音命令识别能力。具体来说，我们首先使用 VGG 风格和 VGG-skip 风格网络对 log-Mel 频谱图和 log-Gammatone 频谱图进行比较。然后，选出表现最佳的模型，并使用不同的注意机制进行研究，包括信道-时间注意、信道-频率注意和信道-时间-频率注意。最后，使用具有交叉注意力的双 CNN 进行语音命令分类。实验使用了一个自制的数据集，其中包括 40 名参与者和 12 个类别，这些数据都是用普通话录制的，在不同的环境下使用了各种智能手机设备。实验结果表明，使用具有交叉注意力的对数-伽马通频谱图和 VGG-skip 风格网络可以获得最佳性能，准确率、精确度、召回率和 F1 分数分别为 94.59%、95.84%、94.64% 和 94.57%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Multimedia Tools and Applications 工程技术-工程：电子与电气

CiteScore

7.20

自引率

16.70%

发文量

2439

审稿时长

9.2 months

期刊介绍： Multimedia Tools and Applications publishes original research articles on multimedia development and system support tools as well as case studies of multimedia applications. It also features experimental and survey articles. The journal is intended for academics, practitioners, scientists and engineers who are involved in multimedia system research, design and applications. All papers are peer reviewed. Specific areas of interest include: - Multimedia Tools: - Multimedia Applications: - Prototype multimedia systems and platforms