Endangered Tujia language Speech Recognition Research based on Audio-Visual Fusion

Chongchong Yu, Jiaqi Yu, Zhaopeng Qian, Yuchen Tan
{"title":"Endangered Tujia language Speech Recognition Research based on Audio-Visual Fusion","authors":"Chongchong Yu, Jiaqi Yu, Zhaopeng Qian, Yuchen Tan","doi":"10.1145/3582099.3582128","DOIUrl":null,"url":null,"abstract":"As an endangered language, Tujia language is a non-renewable intangible cultural resource. Automatic speech recognition (ASR) uses artificial intelligence technology to facilitate manually label Tujia language, which is an effective means to protect this language. However, due to the fact that Tujia language has few native speakers, few labeled corpus, and much noise in the corpus. The acoustic models thus suffer from over fitting and lowe noise immunity, which seriously harms the accuracy of recognition. To tackle the deficiencies, an approach of audio-visual speech recognition (AVSR) based on Transformer-CTC is proposed, which reduces the dependence of acoustic models on noise and the quantity of data by introducing visual modality in-formation including lip movements. Specifically, the new approach enhances the expression of speakers’ feature space through the fusion of audio and visual information, thus solving the problem of less available information for single modality. Experiment results show that the optimal CER of AVSR is 8.2% lower than that of traditional models, and 11.8% lower than that for lip reading. The proposed AVSR tackles the issue of low accuracy in recognizing endangered languages. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.","PeriodicalId":222372,"journal":{"name":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th Artificial Intelligence and Cloud Computing Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3582099.3582128","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

As an endangered language, Tujia language is a non-renewable intangible cultural resource. Automatic speech recognition (ASR) uses artificial intelligence technology to facilitate manually label Tujia language, which is an effective means to protect this language. However, due to the fact that Tujia language has few native speakers, few labeled corpus, and much noise in the corpus. The acoustic models thus suffer from over fitting and lowe noise immunity, which seriously harms the accuracy of recognition. To tackle the deficiencies, an approach of audio-visual speech recognition (AVSR) based on Transformer-CTC is proposed, which reduces the dependence of acoustic models on noise and the quantity of data by introducing visual modality in-formation including lip movements. Specifically, the new approach enhances the expression of speakers’ feature space through the fusion of audio and visual information, thus solving the problem of less available information for single modality. Experiment results show that the optimal CER of AVSR is 8.2% lower than that of traditional models, and 11.8% lower than that for lip reading. The proposed AVSR tackles the issue of low accuracy in recognizing endangered languages. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.
基于视听融合的濒危土家族语言语音识别研究
土家族语作为一种濒危语言,是不可再生的非物质文化资源。自动语音识别(ASR)利用人工智能技术,方便人工标注土家族语言,是保护土家族语言的有效手段。然而,由于土家族语的母语使用者少,标注语料库少,语料库噪声大。声学模型存在过拟合和抗噪能力低的问题,严重影响了识别的准确性。针对上述不足,提出了一种基于Transformer-CTC的视听语音识别方法,该方法通过引入包括唇形运动在内的视觉模态信息,降低了声学模型对噪声和数据量的依赖。具体来说,该方法通过融合视听信息增强了说话人特征空间的表达,从而解决了单一模态可用信息少的问题。实验结果表明,AVSR模型的最佳识别率比传统模型低8.2%,比唇读模型低11.8%。提出的AVSR解决了濒危语言识别准确率低的问题。因此,AVSR对于利用人工智能研究濒危语言的保护和保存具有重要意义。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信