Gram Vaani ASR挑战印地语地区变体的自发电话语音记录

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-11371

Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad

{"title":"Gram Vaani ASR挑战印地语地区变体的自发电话语音记录","authors":"Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad","doi":"10.21437/interspeech.2022-11371","DOIUrl":null,"url":null,"abstract":"This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani . The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29 . 7% and 15 . 1% , respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32 . 9% and 19 . 0% , respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3548-3552"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi\",\"authors\":\"Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad\",\"doi\":\"10.21437/interspeech.2022-11371\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani . The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29 . 7% and 15 . 1% , respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32 . 9% and 19 . 0% , respectively.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"3548-3552\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-11371\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-11371","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文描述了在印地语区域变体中Gram-Vaani自动语音识别（ASR）挑战的语料库和基线系统。这一挑战的语料库包括社交科技企业Gram Vaani收集的自发电话语音记录。印地语的区域变异，加上语音的自发性、自然背景和由于众包而具有可变准确性的转录，使其成为ASR关于自发电话语音的独特语料库。作为挑战的一部分，已经发布了大约1108小时的真实世界自发语音记录，包括1000小时的未标记训练数据、100小时的标记训练数据，5小时的发展数据和3小时的评估数据。在传统的时延神经网络隐马尔可夫模型（TDNN-HMM）框架和完全神经端到端（E2E）设置中，在不同的ASR系统上验证了训练集和测试集的有效性。在100小时的标记数据上训练的TDNN模型的eval集上的字错误率（WER）和字符错误率（CER）为29。7%和15。分别为1%。而在E2E设置中，在100小时的数据上训练的一致性模型的评估集上的WER和CER为32。9%和19。分别为0%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi

This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani . The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29 . 7% and 15 . 1% , respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32 . 9% and 19 . 0% , respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量