Development and evaluation of unit selection and HMM-based speech synthesis systems for Tamil

2013 National Conference on Communications (NCC) Pub Date : 2013-03-28 DOI:10.1109/NCC.2013.6487984

Ramani Boothalingam, V. Sherlin Solomi, A. R. Gladston, S. Christina, P. Vijayalakshmi, N. Thangavelu, H. Murthy

{"title":"Development and evaluation of unit selection and HMM-based speech synthesis systems for Tamil","authors":"Ramani Boothalingam, V. Sherlin Solomi, A. R. Gladston, S. Christina, P. Vijayalakshmi, N. Thangavelu, H. Murthy","doi":"10.1109/NCC.2013.6487984","DOIUrl":null,"url":null,"abstract":"An unrestricted text-to-speech system is expected to produce a speech signal, corresponding to the given text in a language, that is highly intelligible to a human listener. Presently, unit selection-based synthesis (USS) and statistical parametric synthesis techniques are the state-of-art techniques for this task. Earlier, in [3], a concatenative synthesizer was developed for the language, Tamil, using 12 hrs of speech data, and shown that syllable is the better subword unit. The current work focuses on building FestVox voices using phoneme/CV unit as the subword unit, for a reduced amount of speech data (5 hrs) and to compare their performances in terms of quality. Further, the focus is to compare the performance of this synthesizer with that of the well known HMM-based speech synthesizer. Among the phoneme and CV-based systems built, although there are bound to be more concatenation points in a phoneme-based system, it is observed that it triumphs the CV-based system with an MOS of 2.96, primarily because, there are more examples available for each phoneme for the given amount of speech data. Further, an HMM-based speech synthesis system is developed using 5 hrs data. Although, in the synthesized speech, the speaker identity is not completely preserved, there are no sonic-glitches and the quality obtained is much better than that of a phoneme/CV-based systems, with an MOS of 3.86. Further, the footprint size of the system is exorbitantly reduced from 1 GB in USS system to 6 MB in HMM-based speech synthesis system.","PeriodicalId":202526,"journal":{"name":"2013 National Conference on Communications (NCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2013.6487984","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

An unrestricted text-to-speech system is expected to produce a speech signal, corresponding to the given text in a language, that is highly intelligible to a human listener. Presently, unit selection-based synthesis (USS) and statistical parametric synthesis techniques are the state-of-art techniques for this task. Earlier, in [3], a concatenative synthesizer was developed for the language, Tamil, using 12 hrs of speech data, and shown that syllable is the better subword unit. The current work focuses on building FestVox voices using phoneme/CV unit as the subword unit, for a reduced amount of speech data (5 hrs) and to compare their performances in terms of quality. Further, the focus is to compare the performance of this synthesizer with that of the well known HMM-based speech synthesizer. Among the phoneme and CV-based systems built, although there are bound to be more concatenation points in a phoneme-based system, it is observed that it triumphs the CV-based system with an MOS of 2.96, primarily because, there are more examples available for each phoneme for the given amount of speech data. Further, an HMM-based speech synthesis system is developed using 5 hrs data. Although, in the synthesized speech, the speaker identity is not completely preserved, there are no sonic-glitches and the quality obtained is much better than that of a phoneme/CV-based systems, with an MOS of 3.86. Further, the footprint size of the system is exorbitantly reduced from 1 GB in USS system to 6 MB in HMM-based speech synthesis system.

查看原文本刊更多论文

基于单元选择和hmm的泰米尔语语音合成系统的开发与评价

一个不受限制的文本转语音系统被期望产生一个语音信号，对应于一种语言中的给定文本，这对人类听者来说是高度可理解的。目前，基于单元选择的合成(USS)和统计参数合成技术是这项任务的最新技术。早些时候，在[3]中，使用12小时的语音数据为泰米尔语开发了一个连接合成器，并表明音节是更好的子词单位。目前的工作重点是使用音素/CV单位作为子词单位构建FestVox语音，减少语音数据量(5小时)，并比较它们在质量方面的表现。此外，重点是将该合成器的性能与众所周知的基于hmm的语音合成器的性能进行比较。在构建的基于音素和基于cv的系统中，虽然基于音素的系统中必然有更多的连接点，但观察到它以2.96的MOS优于基于cv的系统，主要是因为对于给定的语音数据量，每个音素有更多的可用示例。在此基础上，利用5hrs数据开发了基于hmm的语音合成系统。虽然合成的语音没有完全保留说话人的身份，但没有出现声音故障，并且得到的质量比基于音素/ cv的系统要好得多，MOS为3.86。此外，系统的内存占用大小从USS系统中的1 GB大幅减少到基于hmm的语音合成系统中的6 MB。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 National Conference on Communications (NCC)

自引率

0.00%

发文量