DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion

IF 3.5 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Intelligence Pub Date : 2024-12-26 DOI:10.1007/s10489-024-06119-0

Jinghan Wu, Yakun Zhang, Meishan Zhang, Changyan Zheng, Xingyu Zhang, Liang Xie, Xingwei An, Erwei Yin

{"title":"DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion","authors":"Jinghan Wu, Yakun Zhang, Meishan Zhang, Changyan Zheng, Xingyu Zhang, Liang Xie, Xingwei An, Erwei Yin","doi":"10.1007/s10489-024-06119-0","DOIUrl":null,"url":null,"abstract":"<div><p>Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 3","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-06119-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise.

Abstract Image

查看原文本刊更多论文

DuAGNet：使用双自适应门融合的无限制多模态语音识别框架

语音识别是人机交互的主要沟通渠道，具有突出的突破。然而，单模态语音识别在高噪声或无声通信应用中的实用性并不令人满意。多模态集成可以有效地解决这一问题，但现有的融合方法往往过于注重语义特征的对齐和模态间融合特征的构建，忽略了对单模态特征的保留。本文利用音频信号、唇区图像的视觉线索和面部肌电信号进行无限制语音识别，可以有效抵抗单一模态带来的噪声干扰。为了保持每种语音模态的独特特征表达并提高对它们之间耦合相关性的全局感知，提出了一种双自适应门融合框架（称为DuAGNet），该框架利用了特定模态和特定特征的自适应门网络。为了验证多模态语音数据集的有效性，我们从40个主题中构建了一个多模态语音数据集，该数据集涵盖了语音数据的3种模态和100类汉语短语。在测试数据干净的情况下，与引入严重音频噪声的语音识别系统相比，识别准确率最高可达98.79%，标准差最低可达0.83，准确率最高可达80%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Intelligence 工程技术-计算机：人工智能

CiteScore

6.60

自引率

20.80%

发文量

1361

审稿时长

5.9 months

期刊介绍： With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance. The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.