Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam
{"title":"MECOS: A bilingual Manipuri–English spontaneous code-switching speech corpus for automatic speech recognition","authors":"Naorem Karline Singh, Yambem Jina Chanu, Hoomexsun Pangsatabam","doi":"10.1016/j.csl.2024.101627","DOIUrl":null,"url":null,"abstract":"<div><p>In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"87 ","pages":"Article 101627"},"PeriodicalIF":3.1000,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082400010X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In this study, we introduce a new code-switched speech database with 57h of Manipuri–English annotated spontaneous speech. Manipuri is an official language of India and is primarily spoken in the north–eastern Indian state of Manipur. Most Manipur native speakers today are bilingual and frequently use code switching in everyday discussions. By carefully assessing the amount of code-switched speech in each video, recordings from YouTube are gathered. 21,339 utterances and 291,731 instances of code switching are present in the database. Given the code-switching nature of the data, a proper annotation procedure is used, and the data are manually annotated using the Meitei Mayek unicode font and the roman alphabets for Manipuri and English, respectively. The transcription includes the information of the speakers, non-speech information, and the corresponding annotation. The aim of this research is to construct an automatic speech recognition (ASR) system as well as offer a thorough analysis and details of the speech corpus. We believe that our research is the first to use an ASR system for Manipuri–English code-switched speech. To evaluate the performance, ASR systems based on hybrid deep neural network and hidden Markov model (DNN–HMM), time delay neural network (TDNN), hybrid time delay neural network and long short-term memory (TDNN–LSTM) and three end-to-end (E2E) models i.e. hybrid connectionist temporal classification and attention model (CTC-Attention), Conformer, wav2vec XLSR are developed for Manipuri–English language. In comparison to other models, pure TDNN produces outcomes that are clearly superior.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.