Zheng Jiang, Yang Xu, Yanyan Xu, Dengfeng Ke, Kaile Su
{"title":"CM-CIF: Cross-Modal for Unaligned Modality Fusion with Continuous Integrate-and-Fire","authors":"Zheng Jiang, Yang Xu, Yanyan Xu, Dengfeng Ke, Kaile Su","doi":"10.1109/icccs55155.2022.9846612","DOIUrl":null,"url":null,"abstract":"The purpose of Audio-Visual Speech Recognition is to identify the content of the spoken sentence by extracting the lip movement features and acoustic features from an input video file containing a person's conversation. Although the current audio-visual fusion models solve the problem of inconsistency in the time length of different modalities to a certain extent, the fusion of the modalities may cause acoustic boundary ambiguity. To better solve this problem, in this paper, we propose a model named Cross-Modal Continuous Integrate-and-Fire (CM-CIF). The model integrates cross-modal information to the accumulated weight so that the acoustic boundary can be located more accurate. We use the Transformer-seq2seq model as the baseline and test CM-CIF on the public datasets LRS2 and LRS3. Experimental results show that CM-CIF achieves a competitive performance.","PeriodicalId":121713,"journal":{"name":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Computer and Communication Systems (ICCCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icccs55155.2022.9846612","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The purpose of Audio-Visual Speech Recognition is to identify the content of the spoken sentence by extracting the lip movement features and acoustic features from an input video file containing a person's conversation. Although the current audio-visual fusion models solve the problem of inconsistency in the time length of different modalities to a certain extent, the fusion of the modalities may cause acoustic boundary ambiguity. To better solve this problem, in this paper, we propose a model named Cross-Modal Continuous Integrate-and-Fire (CM-CIF). The model integrates cross-modal information to the accumulated weight so that the acoustic boundary can be located more accurate. We use the Transformer-seq2seq model as the baseline and test CM-CIF on the public datasets LRS2 and LRS3. Experimental results show that CM-CIF achieves a competitive performance.