利用内分法对非自回归文本到语音进行可变时长细化

IF 3.2 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Signal Processing Letters Pub Date : 2024-11-11 DOI:10.1109/LSP.2024.3495578

Jaeuk Lee;Yoonsoo Shin;Joon-Hyuk Chang

{"title":"利用内分法对非自回归文本到语音进行可变时长细化","authors":"Jaeuk Lee;Yoonsoo Shin;Joon-Hyuk Chang","doi":"10.1109/LSP.2024.3495578","DOIUrl":null,"url":null,"abstract":"Most non-autoregressive text-to-speech (TTS) models acquire target phoneme duration (target duration) from internal or external aligners. They transform the speech-phoneme alignment produced by the aligner into the target duration. Since this transformation is not differentiable, the gradient of the loss function that maximizes the TTS model's likelihood of speech (e.g., mel spectrogram or waveform) cannot be propagated to the target duration. In other words, the target duration is produced regardless of the TTS model's likelihood of speech. Hence, we introduce a differentiable duration refinement that produces a learnable target duration for maximizing the likelihood of speech. The proposed method uses an internal division to locate the phoneme boundary, which is determined to improve the performance of the TTS model. Additionally, we propose a duration distribution loss to enhance the performance of the duration predictor. Our baseline model is JETS, a representative end-to-end TTS model, and we apply the proposed methods to the baseline model. Experimental results show that the proposed method outperforms the baseline model in terms of subjective naturalness and character error rate.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3154-3158"},"PeriodicalIF":3.2000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Differentiable Duration Refinement Using Internal Division for Non-Autoregressive Text-to-Speech\",\"authors\":\"Jaeuk Lee;Yoonsoo Shin;Joon-Hyuk Chang\",\"doi\":\"10.1109/LSP.2024.3495578\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Most non-autoregressive text-to-speech (TTS) models acquire target phoneme duration (target duration) from internal or external aligners. They transform the speech-phoneme alignment produced by the aligner into the target duration. Since this transformation is not differentiable, the gradient of the loss function that maximizes the TTS model's likelihood of speech (e.g., mel spectrogram or waveform) cannot be propagated to the target duration. In other words, the target duration is produced regardless of the TTS model's likelihood of speech. Hence, we introduce a differentiable duration refinement that produces a learnable target duration for maximizing the likelihood of speech. The proposed method uses an internal division to locate the phoneme boundary, which is determined to improve the performance of the TTS model. Additionally, we propose a duration distribution loss to enhance the performance of the duration predictor. Our baseline model is JETS, a representative end-to-end TTS model, and we apply the proposed methods to the baseline model. Experimental results show that the proposed method outperforms the baseline model in terms of subjective naturalness and character error rate.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"31 \",\"pages\":\"3154-3158\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10750273/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10750273/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

大多数非自回归文本到语音（TTS）模型都从内部或外部对齐器中获取目标音素时长（目标时长）。它们将对齐器产生的语音-音素对齐转换为目标持续时间。由于这种转换是不可微的，因此能最大化 TTS 模型语音可能性的损失函数梯度（如融化频谱图或波形）无法传播到目标时长。换句话说，目标时长的产生与 TTS 模型的语音可能性无关。因此，我们引入了一种可微分的持续时间细化方法，它能产生可学习的目标持续时间，从而最大限度地提高语音的可能性。所提出的方法使用内部分割来定位音素边界，以提高 TTS 模型的性能。此外，我们还提出了时长分布损失，以提高时长预测器的性能。我们的基准模型是具有代表性的端到端 TTS 模型 JETS，我们将提出的方法应用于基准模型。实验结果表明，所提出的方法在主观自然度和字符错误率方面优于基线模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Differentiable Duration Refinement Using Internal Division for Non-Autoregressive Text-to-Speech

Most non-autoregressive text-to-speech (TTS) models acquire target phoneme duration (target duration) from internal or external aligners. They transform the speech-phoneme alignment produced by the aligner into the target duration. Since this transformation is not differentiable, the gradient of the loss function that maximizes the TTS model's likelihood of speech (e.g., mel spectrogram or waveform) cannot be propagated to the target duration. In other words, the target duration is produced regardless of the TTS model's likelihood of speech. Hence, we introduce a differentiable duration refinement that produces a learnable target duration for maximizing the likelihood of speech. The proposed method uses an internal division to locate the phoneme boundary, which is determined to improve the performance of the TTS model. Additionally, we propose a duration distribution loss to enhance the performance of the duration predictor. Our baseline model is JETS, a representative end-to-end TTS model, and we apply the proposed methods to the baseline model. Experimental results show that the proposed method outperforms the baseline model in terms of subjective naturalness and character error rate.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Signal Processing Letters 工程技术-工程：电子与电气

CiteScore

7.40

自引率

12.80%

发文量

339

审稿时长

2.8 months

期刊介绍： The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.