印地语语料库的设计与创建方法

2015 International Conference on Signal Processing and Communication Engineering Systems Pub Date : 2015-03-12 DOI:10.1109/SPACES.2015.7058279

D. Magdum, Manisha Shukla Dubey, T. Patil, Ronak Shah, S. Belhe, Mahesh Kulkarni

{"title":"印地语语料库的设计与创建方法","authors":"D. Magdum, Manisha Shukla Dubey, T. Patil, Ronak Shah, S. Belhe, Mahesh Kulkarni","doi":"10.1109/SPACES.2015.7058279","DOIUrl":null,"url":null,"abstract":"In this paper we have described the methodologies that we have used in data collection and recording for our Hindi Text to Speech system. Design of the speech corpus plays a very important role in overall quality of the text-to-speech system. A huge text corpus of one million words was created for existing text-to-speech system. We have crawled text from many domains like financial, government, current news etc. along with pre-built dictionaries. For the very first time, we have also generated and incorporated text from Hindi Short-Messaging-Service (SMS). The efforts were made to make the generic speech corpus for Hindi. The crawled text was first filtered for correctness e.g. spelling mistakes, validity to Hindi, word lengths etc. The filtered words were then carefully analyzed and ensured that phonetically balanced text is prepared. This cured text is then recorded by professional recordist in a studio environment. The recorded speech data is then processed and annotated to generate the final speech corpus. The paper explains the speech corpus creation process, beginning with text data crawling, filtering, recording and annotation phases. The final speech corpus thus generated is used in the Hindi Text-to-Speech system with the MOS of 2.8.","PeriodicalId":432479,"journal":{"name":"2015 International Conference on Signal Processing and Communication Engineering Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Methodology for designing and creating Hindi speech corpus\",\"authors\":\"D. Magdum, Manisha Shukla Dubey, T. Patil, Ronak Shah, S. Belhe, Mahesh Kulkarni\",\"doi\":\"10.1109/SPACES.2015.7058279\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we have described the methodologies that we have used in data collection and recording for our Hindi Text to Speech system. Design of the speech corpus plays a very important role in overall quality of the text-to-speech system. A huge text corpus of one million words was created for existing text-to-speech system. We have crawled text from many domains like financial, government, current news etc. along with pre-built dictionaries. For the very first time, we have also generated and incorporated text from Hindi Short-Messaging-Service (SMS). The efforts were made to make the generic speech corpus for Hindi. The crawled text was first filtered for correctness e.g. spelling mistakes, validity to Hindi, word lengths etc. The filtered words were then carefully analyzed and ensured that phonetically balanced text is prepared. This cured text is then recorded by professional recordist in a studio environment. The recorded speech data is then processed and annotated to generate the final speech corpus. The paper explains the speech corpus creation process, beginning with text data crawling, filtering, recording and annotation phases. The final speech corpus thus generated is used in the Hindi Text-to-Speech system with the MOS of 2.8.\",\"PeriodicalId\":432479,\"journal\":{\"name\":\"2015 International Conference on Signal Processing and Communication Engineering Systems\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on Signal Processing and Communication Engineering Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPACES.2015.7058279\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Signal Processing and Communication Engineering Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPACES.2015.7058279","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

在本文中，我们描述了我们在印地语文本到语音系统的数据收集和记录中使用的方法。语音语料库的设计对文本转语音系统的整体质量起着非常重要的作用。为现有的文本转语音系统创建了一个100万字的庞大文本语料库。我们已经从许多领域抓取了文本，如金融、政府、时事新闻等，以及预先构建的词典。这是我们第一次从印度短消息服务(SMS)中生成和合并文本。为制作印地语通用语料库作出了努力。首先过滤抓取的文本的正确性，例如拼写错误，对印地语的有效性，单词长度等。然后仔细分析过滤的单词，并确保准备语音平衡的文本。然后由专业的录音师在录音室环境中录制此固化文本。然后对所记录的语音数据进行处理和注释以生成最终的语音语料库。本文阐述了语音语料库的创建过程，从文本数据抓取、过滤、记录和标注四个阶段开始。最终生成的语音语料库用于MOS为2.8的印地语文本到语音系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Methodology for designing and creating Hindi speech corpus

In this paper we have described the methodologies that we have used in data collection and recording for our Hindi Text to Speech system. Design of the speech corpus plays a very important role in overall quality of the text-to-speech system. A huge text corpus of one million words was created for existing text-to-speech system. We have crawled text from many domains like financial, government, current news etc. along with pre-built dictionaries. For the very first time, we have also generated and incorporated text from Hindi Short-Messaging-Service (SMS). The efforts were made to make the generic speech corpus for Hindi. The crawled text was first filtered for correctness e.g. spelling mistakes, validity to Hindi, word lengths etc. The filtered words were then carefully analyzed and ensured that phonetically balanced text is prepared. This cured text is then recorded by professional recordist in a studio environment. The recorded speech data is then processed and annotated to generate the final speech corpus. The paper explains the speech corpus creation process, beginning with text data crawling, filtering, recording and annotation phases. The final speech corpus thus generated is used in the Hindi Text-to-Speech system with the MOS of 2.8.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 International Conference on Signal Processing and Communication Engineering Systems

自引率

0.00%

发文量