Chongchong Yu, Jiaqi Yu, Zhaopeng Qian, Yuchen Tan
{"title":"Endangered Tujia language Speech Recognition Research based on Audio-Visual Fusion","authors":"Chongchong Yu, Jiaqi Yu, Zhaopeng Qian, Yuchen Tan","doi":"10.1145/3582099.3582128","DOIUrl":null,"url":null,"abstract":"As an endangered language, Tujia language is a non-renewable intangible cultural resource. Automatic speech recognition (ASR) uses artificial intelligence technology to facilitate manually label Tujia language, which is an effective means to protect this language. However, due to the fact that Tujia language has few native speakers, few labeled corpus, and much noise in the corpus. The acoustic models thus suffer from over fitting and lowe noise immunity, which seriously harms the accuracy of recognition. To tackle the deficiencies, an approach of audio-visual speech recognition (AVSR) based on Transformer-CTC is proposed, which reduces the dependence of acoustic models on noise and the quantity of data by introducing visual modality in-formation including lip movements. Specifically, the new approach enhances the expression of speakers’ feature space through the fusion of audio and visual information, thus solving the problem of less available information for single modality. Experiment results show that the optimal CER of AVSR is 8.2% lower than that of traditional models, and 11.8% lower than that for lip reading. The proposed AVSR tackles the issue of low accuracy in recognizing endangered languages. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.","PeriodicalId":222372,"journal":{"name":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582099.3582128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
As an endangered language, Tujia language is a non-renewable intangible cultural resource. Automatic speech recognition (ASR) uses artificial intelligence technology to facilitate manually label Tujia language, which is an effective means to protect this language. However, due to the fact that Tujia language has few native speakers, few labeled corpus, and much noise in the corpus. The acoustic models thus suffer from over fitting and lowe noise immunity, which seriously harms the accuracy of recognition. To tackle the deficiencies, an approach of audio-visual speech recognition (AVSR) based on Transformer-CTC is proposed, which reduces the dependence of acoustic models on noise and the quantity of data by introducing visual modality in-formation including lip movements. Specifically, the new approach enhances the expression of speakers’ feature space through the fusion of audio and visual information, thus solving the problem of less available information for single modality. Experiment results show that the optimal CER of AVSR is 8.2% lower than that of traditional models, and 11.8% lower than that for lip reading. The proposed AVSR tackles the issue of low accuracy in recognizing endangered languages. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.