{"title":"DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion","authors":"Jinghan Wu, Yakun Zhang, Meishan Zhang, Changyan Zheng, Xingyu Zhang, Liang Xie, Xingwei An, Erwei Yin","doi":"10.1007/s10489-024-06119-0","DOIUrl":null,"url":null,"abstract":"<div><p>Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 3","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-06119-0","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.