Endangered Tujia language Speech Recognition Research based on Audio-Visual Fusion

Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference Pub Date : 2022-12-17 DOI:10.1145/3582099.3582128

Chongchong Yu, Jiaqi Yu, Zhaopeng Qian, Yuchen Tan

{"title":"Endangered Tujia language Speech Recognition Research based on Audio-Visual Fusion","authors":"Chongchong Yu, Jiaqi Yu, Zhaopeng Qian, Yuchen Tan","doi":"10.1145/3582099.3582128","DOIUrl":null,"url":null,"abstract":"As an endangered language, Tujia language is a non-renewable intangible cultural resource. Automatic speech recognition (ASR) uses artificial intelligence technology to facilitate manually label Tujia language, which is an effective means to protect this language. However, due to the fact that Tujia language has few native speakers, few labeled corpus, and much noise in the corpus. The acoustic models thus suffer from over fitting and lowe noise immunity, which seriously harms the accuracy of recognition. To tackle the deficiencies, an approach of audio-visual speech recognition (AVSR) based on Transformer-CTC is proposed, which reduces the dependence of acoustic models on noise and the quantity of data by introducing visual modality in-formation including lip movements. Specifically, the new approach enhances the expression of speakers’ feature space through the fusion of audio and visual information, thus solving the problem of less available information for single modality. Experiment results show that the optimal CER of AVSR is 8.2% lower than that of traditional models, and 11.8% lower than that for lip reading. The proposed AVSR tackles the issue of low accuracy in recognizing endangered languages. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.","PeriodicalId":222372,"journal":{"name":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582099.3582128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

As an endangered language, Tujia language is a non-renewable intangible cultural resource. Automatic speech recognition (ASR) uses artificial intelligence technology to facilitate manually label Tujia language, which is an effective means to protect this language. However, due to the fact that Tujia language has few native speakers, few labeled corpus, and much noise in the corpus. The acoustic models thus suffer from over fitting and lowe noise immunity, which seriously harms the accuracy of recognition. To tackle the deficiencies, an approach of audio-visual speech recognition (AVSR) based on Transformer-CTC is proposed, which reduces the dependence of acoustic models on noise and the quantity of data by introducing visual modality in-formation including lip movements. Specifically, the new approach enhances the expression of speakers’ feature space through the fusion of audio and visual information, thus solving the problem of less available information for single modality. Experiment results show that the optimal CER of AVSR is 8.2% lower than that of traditional models, and 11.8% lower than that for lip reading. The proposed AVSR tackles the issue of low accuracy in recognizing endangered languages. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.

查看原文本刊更多论文

基于视听融合的濒危土家族语言语音识别研究

土家族语作为一种濒危语言，是不可再生的非物质文化资源。自动语音识别(ASR)利用人工智能技术，方便人工标注土家族语言，是保护土家族语言的有效手段。然而，由于土家族语的母语使用者少，标注语料库少，语料库噪声大。声学模型存在过拟合和抗噪能力低的问题，严重影响了识别的准确性。针对上述不足，提出了一种基于Transformer-CTC的视听语音识别方法，该方法通过引入包括唇形运动在内的视觉模态信息，降低了声学模型对噪声和数据量的依赖。具体来说，该方法通过融合视听信息增强了说话人特征空间的表达，从而解决了单一模态可用信息少的问题。实验结果表明，AVSR模型的最佳识别率比传统模型低8.2%，比唇读模型低11.8%。提出的AVSR解决了濒危语言识别准确率低的问题。因此，AVSR对于利用人工智能研究濒危语言的保护和保存具有重要意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference

自引率

0.00%

发文量