MECOS：用于自动语音识别的曼尼普尔语-英语双语自发代码转换语音库

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-02-20 DOI:10.1016/j.csl.2024.101627

Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam

{"title":"MECOS：用于自动语音识别的曼尼普尔语-英语双语自发代码转换语音库","authors":"Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam","doi":"10.1016/j.csl.2024.101627","DOIUrl":null,"url":null,"abstract":"<div><p>In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101627"},"PeriodicalIF":3.1000,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MECOS: A bilingual Manipuri–English spontaneous code-switching speech corpus for automatic speech recognition\",\"authors\":\"Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam\",\"doi\":\"10.1016/j.csl.2024.101627\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.</p></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"87 \",\"pages\":\"Article 101627\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S088523082400010X\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082400010X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在本研究中，我们引入了一个新的代码切换语音数据库，其中包含 57h 曼尼普尔语-英语注释自发语音。曼尼普尔语是印度的官方语言，主要在印度东北部的曼尼普尔邦使用。如今，大多数曼尼普尔本地人都会说两种语言，并且在日常讨论中经常使用语码转换。通过仔细评估每段视频中的代码转换语音量，我们收集了 YouTube 上的录音。数据库中共有 21,339 个语句和 291,731 个语码转换实例。考虑到数据的代码切换性质，我们使用了适当的注释程序，并分别使用曼尼普尔语和英语的迈特马耶克统一编码字体和罗马字母对数据进行了人工注释。转录包括说话人信息、非语音信息和相应的注释。本研究的目的是构建一个自动语音识别（ASR）系统，并对语音语料库进行全面分析和详细说明。我们相信，我们的研究是首个将 ASR 系统用于曼尼普尔语-英语代码转换语音的研究。为了评估性能，我们为曼尼普尔语-英语开发了基于混合深度神经网络和隐马尔可夫模型（DNN-HMM）、时延神经网络（TDNN）、混合时延神经网络和长短期记忆（TDNN-LSTM）以及三种端到端（E2E）模型（即混合连接主义时间分类和注意力模型（CTC-Attention）、Conformer、wav2vec XLSR）的 ASR 系统。与其他模型相比，纯 TDNN 产生的结果明显更优。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MECOS: A bilingual Manipuri–English spontaneous code-switching speech corpus for automatic speech recognition

In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.