宽带谐波模型:高质量语音合成的对准和噪声建模

Speech Synthesis Workshop Pub Date : 2016-09-13 DOI:10.21437/SSW.2016-37

Slava Shechtman, A. Sorin

{"title":"宽带谐波模型:高质量语音合成的对准和噪声建模","authors":"Slava Shechtman, A. Sorin","doi":"10.21437/SSW.2016-37","DOIUrl":null,"url":null,"abstract":"Speech sinusoidal modeling has been successfully applied to a broad range of speech analysis, synthesis and modification tasks. However, developing a high fidelity full band sinusoidal model that preserves its high quality on speech transformation still remains an open research problem. Such a system can be extremely useful for high quality speech synthesis. In this paper we present an enhanced harmonic model representation for voiced/mixed wide band speech that is capable of high quality speech reconstruction and transformation in the parametric domain. Two key elements of the proposed model are a proper phase alignment and a decomposition of a speech frame to \"deterministic\" and dense \"stochastic\" harmonic model representations that can be separately manipulated. The coupling of stochastic harmonic representation with the deterministic one is performed by means of intra-frame periodic energy envelope, estimated at analysis time and preserved during original/transformed speech reconstruction. In addition, we present a compact representation of the stochastic harmonic component, so that the proposed model has less parameters than the regular full band harmonic model, with better Signal to Reconstruction Error performance. On top of that, the improved phase alignment of the proposed model provides better phase coherency in transformed speech, resulting in better quality of speech transformations. We demonstrate the subjective and objective performance of the new model on speech reconstruction and pitch modification tasks. Performance of the proposed model within unit selection TTS is also presented.","PeriodicalId":340820,"journal":{"name":"Speech Synthesis Workshop","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Wideband Harmonic Model: Alignment and Noise Modeling for High Quality Speech Synthesis\",\"authors\":\"Slava Shechtman, A. Sorin\",\"doi\":\"10.21437/SSW.2016-37\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech sinusoidal modeling has been successfully applied to a broad range of speech analysis, synthesis and modification tasks. However, developing a high fidelity full band sinusoidal model that preserves its high quality on speech transformation still remains an open research problem. Such a system can be extremely useful for high quality speech synthesis. In this paper we present an enhanced harmonic model representation for voiced/mixed wide band speech that is capable of high quality speech reconstruction and transformation in the parametric domain. Two key elements of the proposed model are a proper phase alignment and a decomposition of a speech frame to \\\"deterministic\\\" and dense \\\"stochastic\\\" harmonic model representations that can be separately manipulated. The coupling of stochastic harmonic representation with the deterministic one is performed by means of intra-frame periodic energy envelope, estimated at analysis time and preserved during original/transformed speech reconstruction. In addition, we present a compact representation of the stochastic harmonic component, so that the proposed model has less parameters than the regular full band harmonic model, with better Signal to Reconstruction Error performance. On top of that, the improved phase alignment of the proposed model provides better phase coherency in transformed speech, resulting in better quality of speech transformations. We demonstrate the subjective and objective performance of the new model on speech reconstruction and pitch modification tasks. Performance of the proposed model within unit selection TTS is also presented.\",\"PeriodicalId\":340820,\"journal\":{\"name\":\"Speech Synthesis Workshop\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Synthesis Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/SSW.2016-37\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Synthesis Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/SSW.2016-37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

语音正弦建模已经成功地应用于广泛的语音分析、合成和修改任务。然而，如何开发一种保真度高的全频带正弦模型，并在语音变换中保持其高质量，仍然是一个有待研究的问题。这样的系统对于高质量的语音合成非常有用。本文提出了一种增强的浊音/混合宽带语音谐波模型表示，能够在参数域进行高质量的语音重构和变换。提出的模型的两个关键要素是适当的相位对齐和将语音帧分解为可以单独操作的“确定性”和密集的“随机”谐波模型表示。随机谐波表示与确定性谐波表示的耦合是通过帧内周期能量包络实现的，在分析时估计，并在原始/变换后的语音重建过程中保留。此外，我们提出了一种紧凑的随机谐波分量表示，使得所提出的模型比常规的全频带谐波模型具有更少的参数，具有更好的信重构误差性能。在此基础上，改进的相位对准模型在转换后的语音中提供了更好的相位一致性，从而提高了语音转换的质量。我们展示了新模型在语音重建和音高修改任务上的主客观性能。本文还介绍了该模型在单元选择TTS中的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Wideband Harmonic Model: Alignment and Noise Modeling for High Quality Speech Synthesis

Speech sinusoidal modeling has been successfully applied to a broad range of speech analysis, synthesis and modification tasks. However, developing a high fidelity full band sinusoidal model that preserves its high quality on speech transformation still remains an open research problem. Such a system can be extremely useful for high quality speech synthesis. In this paper we present an enhanced harmonic model representation for voiced/mixed wide band speech that is capable of high quality speech reconstruction and transformation in the parametric domain. Two key elements of the proposed model are a proper phase alignment and a decomposition of a speech frame to "deterministic" and dense "stochastic" harmonic model representations that can be separately manipulated. The coupling of stochastic harmonic representation with the deterministic one is performed by means of intra-frame periodic energy envelope, estimated at analysis time and preserved during original/transformed speech reconstruction. In addition, we present a compact representation of the stochastic harmonic component, so that the proposed model has less parameters than the regular full band harmonic model, with better Signal to Reconstruction Error performance. On top of that, the improved phase alignment of the proposed model provides better phase coherency in transformed speech, resulting in better quality of speech transformations. We demonstrate the subjective and objective performance of the new model on speech reconstruction and pitch modification tasks. Performance of the proposed model within unit selection TTS is also presented.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Synthesis Workshop

自引率

0.00%

发文量