Development of Hindi mobile communication text and speech corpus

2011 International Conference on Speech Database and Assessments (Oriental COCOSDA) Pub Date : 2011-11-28 DOI:10.1109/ICSDA.2011.6085975

S. Sinha, S. Agrawal, Jesper Ø. Olsen

{"title":"Development of Hindi mobile communication text and speech corpus","authors":"S. Sinha, S. Agrawal, Jesper Ø. Olsen","doi":"10.1109/ICSDA.2011.6085975","DOIUrl":null,"url":null,"abstract":"This paper describes the collection of a text and audio corpus for mobile personal communication in Hindi. Hindi is the largest of the Indian languages, and is the first language for more than 200 million people who use it not only for spoken mobile communication but also for sending text messages to each other. The main script for Hindi is Devanagari, but it is not well supported by the current generation of mobile devices. The Devanagari alphabet is twice as large as for English which makes it difficult to fit onto the small keypad of a mobile device. The aim of this project is to collect text and speech resources which can be used for training spoken language systems that aide text messaging on mobile devices - i.e. train a speech recogniser for the mobile personal communication domain so that text can be input through dictation rather than by typing. In total we collected a text corpus of 2 million words of natural messages in 12 different domains, and a spoken corpus of 100 speakers who each spoke 630 phonetically rich sentences - about 4 hours of speech. The speech utterances were recorded in 16 kHz through 3 recording channels: a mobile phone, a headset and a desktop mounted microphone. The data sets were properly annotated and available for development of speech recognition / synthesis systems in mobile domain.","PeriodicalId":269402,"journal":{"name":"2011 International Conference on Speech Database and Assessments (Oriental COCOSDA)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Speech Database and Assessments (Oriental COCOSDA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSDA.2011.6085975","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

This paper describes the collection of a text and audio corpus for mobile personal communication in Hindi. Hindi is the largest of the Indian languages, and is the first language for more than 200 million people who use it not only for spoken mobile communication but also for sending text messages to each other. The main script for Hindi is Devanagari, but it is not well supported by the current generation of mobile devices. The Devanagari alphabet is twice as large as for English which makes it difficult to fit onto the small keypad of a mobile device. The aim of this project is to collect text and speech resources which can be used for training spoken language systems that aide text messaging on mobile devices - i.e. train a speech recogniser for the mobile personal communication domain so that text can be input through dictation rather than by typing. In total we collected a text corpus of 2 million words of natural messages in 12 different domains, and a spoken corpus of 100 speakers who each spoke 630 phonetically rich sentences - about 4 hours of speech. The speech utterances were recorded in 16 kHz through 3 recording channels: a mobile phone, a headset and a desktop mounted microphone. The data sets were properly annotated and available for development of speech recognition / synthesis systems in mobile domain.

查看原文本刊更多论文

印地语移动通信文本和语音语料库的开发

本文介绍了印地语移动个人交流的文本和音频语料库的收集。印地语是印度最大的语言，是超过2亿人的第一语言，他们不仅用它来进行口头移动通信，还用它来相互发送短信。印地语的主要文字是Devanagari，但当前一代的移动设备不太支持它。Devanagari字母表是英语的两倍大，这使得它很难放在移动设备的小键盘上。这个项目的目的是收集文本和语音资源，这些资源可用于训练在移动设备上辅助短信的口语系统，即训练用于移动个人通信领域的语音识别器，以便通过听写而不是打字输入文本。我们总共收集了12个不同领域的200万单词的自然信息的文本语料库，以及100位发言者的口语语料库，他们每人说630个语音丰富的句子-大约4小时的演讲。通过3个录音通道:手机、耳机和安装在桌面的麦克风，以16khz的频率录制语音。对数据集进行了适当的注释，可用于开发移动领域的语音识别/合成系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 International Conference on Speech Database and Assessments (Oriental COCOSDA)

自引率

0.00%

发文量