Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition

2023 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2023-07-01 DOI:10.1109/ICME55011.2023.00116

Jinxin Wang, Zhongwen Guo, Chao Yang, Xiaomei Li, Ziyuan Cui

{"title":"Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition","authors":"Jinxin Wang, Zhongwen Guo, Chao Yang, Xiaomei Li, Ziyuan Cui","doi":"10.1109/ICME55011.2023.00116","DOIUrl":null,"url":null,"abstract":"Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the optimal combination of different predicted results. In this paper, we propose a multi-scale hybrid fusion network (MSHF) for mandarin audio-visual speech recognition. Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi-scale feature extraction module (MSFE) to obtain multi-scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information, optimizing the weights of prediction results for different modalities to achieve the best classification. We further design a feature recognition module (FRM) for accurate audio-visual speech recognition. We conducted experiments on the CAS-VSR-W1k dataset. The experimental results show that the proposed method outperforms the selected competitive baselines and the state-of-the-art, indicating the superiority of our proposed modules.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the optimal combination of different predicted results. In this paper, we propose a multi-scale hybrid fusion network (MSHF) for mandarin audio-visual speech recognition. Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi-scale feature extraction module (MSFE) to obtain multi-scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information, optimizing the weights of prediction results for different modalities to achieve the best classification. We further design a feature recognition module (FRM) for accurate audio-visual speech recognition. We conducted experiments on the CAS-VSR-W1k dataset. The experimental results show that the proposed method outperforms the selected competitive baselines and the state-of-the-art, indicating the superiority of our proposed modules.

查看原文本刊更多论文

普通话视听语音识别的多尺度混合融合网络

与特征融合或决策融合相比，混合融合能有效提高视听语音识别的准确率。现有的工作主要倾向于设计多模态特征提取过程、交互和预测，忽略了多模态的有用信息和不同预测结果的最优组合。本文提出了一种多尺度混合融合网络(MSHF)用于汉语视听语音识别。MSHF由特征提取子网络和混合融合子网络组成，前者利用所提出的多尺度特征提取模块(MSFE)获取多尺度特征，后者整合不同模态信息的内在相关性，优化不同模态预测结果的权重，以实现最佳分类。我们进一步设计了一个特征识别模块(FRM)来实现准确的视听语音识别。我们在CAS-VSR-W1k数据集上进行了实验。实验结果表明，所提出的方法优于所选的竞争基准和最先进的方法，表明了我们所提出模块的优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量