印地语语料库的设计与创建方法

D. Magdum, Manisha Shukla Dubey, T. Patil, Ronak Shah, S. Belhe, Mahesh Kulkarni
{"title":"印地语语料库的设计与创建方法","authors":"D. Magdum, Manisha Shukla Dubey, T. Patil, Ronak Shah, S. Belhe, Mahesh Kulkarni","doi":"10.1109/SPACES.2015.7058279","DOIUrl":null,"url":null,"abstract":"In this paper we have described the methodologies that we have used in data collection and recording for our Hindi Text to Speech system. Design of the speech corpus plays a very important role in overall quality of the text-to-speech system. A huge text corpus of one million words was created for existing text-to-speech system. We have crawled text from many domains like financial, government, current news etc. along with pre-built dictionaries. For the very first time, we have also generated and incorporated text from Hindi Short-Messaging-Service (SMS). The efforts were made to make the generic speech corpus for Hindi. The crawled text was first filtered for correctness e.g. spelling mistakes, validity to Hindi, word lengths etc. The filtered words were then carefully analyzed and ensured that phonetically balanced text is prepared. This cured text is then recorded by professional recordist in a studio environment. The recorded speech data is then processed and annotated to generate the final speech corpus. The paper explains the speech corpus creation process, beginning with text data crawling, filtering, recording and annotation phases. The final speech corpus thus generated is used in the Hindi Text-to-Speech system with the MOS of 2.8.","PeriodicalId":432479,"journal":{"name":"2015 International Conference on Signal Processing and Communication Engineering Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Methodology for designing and creating Hindi speech corpus\",\"authors\":\"D. Magdum, Manisha Shukla Dubey, T. Patil, Ronak Shah, S. Belhe, Mahesh Kulkarni\",\"doi\":\"10.1109/SPACES.2015.7058279\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we have described the methodologies that we have used in data collection and recording for our Hindi Text to Speech system. Design of the speech corpus plays a very important role in overall quality of the text-to-speech system. A huge text corpus of one million words was created for existing text-to-speech system. We have crawled text from many domains like financial, government, current news etc. along with pre-built dictionaries. For the very first time, we have also generated and incorporated text from Hindi Short-Messaging-Service (SMS). The efforts were made to make the generic speech corpus for Hindi. The crawled text was first filtered for correctness e.g. spelling mistakes, validity to Hindi, word lengths etc. The filtered words were then carefully analyzed and ensured that phonetically balanced text is prepared. This cured text is then recorded by professional recordist in a studio environment. The recorded speech data is then processed and annotated to generate the final speech corpus. The paper explains the speech corpus creation process, beginning with text data crawling, filtering, recording and annotation phases. The final speech corpus thus generated is used in the Hindi Text-to-Speech system with the MOS of 2.8.\",\"PeriodicalId\":432479,\"journal\":{\"name\":\"2015 International Conference on Signal Processing and Communication Engineering Systems\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on Signal Processing and Communication Engineering Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPACES.2015.7058279\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Signal Processing and Communication Engineering Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPACES.2015.7058279","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

在本文中,我们描述了我们在印地语文本到语音系统的数据收集和记录中使用的方法。语音语料库的设计对文本转语音系统的整体质量起着非常重要的作用。为现有的文本转语音系统创建了一个100万字的庞大文本语料库。我们已经从许多领域抓取了文本,如金融、政府、时事新闻等,以及预先构建的词典。这是我们第一次从印度短消息服务(SMS)中生成和合并文本。为制作印地语通用语料库作出了努力。首先过滤抓取的文本的正确性,例如拼写错误,对印地语的有效性,单词长度等。然后仔细分析过滤的单词,并确保准备语音平衡的文本。然后由专业的录音师在录音室环境中录制此固化文本。然后对所记录的语音数据进行处理和注释以生成最终的语音语料库。本文阐述了语音语料库的创建过程,从文本数据抓取、过滤、记录和标注四个阶段开始。最终生成的语音语料库用于MOS为2.8的印地语文本到语音系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Methodology for designing and creating Hindi speech corpus
In this paper we have described the methodologies that we have used in data collection and recording for our Hindi Text to Speech system. Design of the speech corpus plays a very important role in overall quality of the text-to-speech system. A huge text corpus of one million words was created for existing text-to-speech system. We have crawled text from many domains like financial, government, current news etc. along with pre-built dictionaries. For the very first time, we have also generated and incorporated text from Hindi Short-Messaging-Service (SMS). The efforts were made to make the generic speech corpus for Hindi. The crawled text was first filtered for correctness e.g. spelling mistakes, validity to Hindi, word lengths etc. The filtered words were then carefully analyzed and ensured that phonetically balanced text is prepared. This cured text is then recorded by professional recordist in a studio environment. The recorded speech data is then processed and annotated to generate the final speech corpus. The paper explains the speech corpus creation process, beginning with text data crawling, filtering, recording and annotation phases. The final speech corpus thus generated is used in the Hindi Text-to-Speech system with the MOS of 2.8.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信