Experimental investigation on the efficacy of Affine-DTW in the quality of voice conversion

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Pub Date : 2019-11-01 DOI:10.1109/APSIPAASC47483.2019.9023107

Gaku Kotani, Hitoshi Suda, D. Saito, N. Minematsu

{"title":"Experimental investigation on the efficacy of Affine-DTW in the quality of voice conversion","authors":"Gaku Kotani, Hitoshi Suda, D. Saito, N. Minematsu","doi":"10.1109/APSIPAASC47483.2019.9023107","DOIUrl":null,"url":null,"abstract":"In this paper, the performance of Affine-DTW, which performs appropriate time alignment between source and target features in voice conversion (VC), is experimentally and thoroughly investigated. In traditional VC, parallel data are often required to train a mapping model between source and target features. While VC with non-parallel data is also studied to avoid collecting parallel data, the quality of its converted speech is still inferior to the traditional one with parallel data. One approach to further progress in VC is exploiting both parallel and non-parallel data, the former of which is pre-stored and the latter of which is assumed to be easily collected. In this case, it is still worthwhile to study time-alignment techniques to obtain appropriate alignment of parallel data. Affine-DTW is a technique in which dynamic time warping (DTW) and coarse conversion based on affine transformation are iteratively performed. In Affine-DTW, time alignment and parameters of affine transformation can be analytically calculated so that it can be easily adopted as pre-processing in VC. However, the influence on the performance of trained models based on the obtained alignments has not been well investigated experimentally. Hence, this paper investigates the performance of Affine-DTW in terms of quality improvement of converted speech in traditional VC methods based on Gaussian mixture models, non-negative matrix factorization and neural networks. Experimental results show that Affine-DTW obtains appropriate alignments and the naturalness improvement of converted speech in subjective assessments is observed in trained models based on the alignments.","PeriodicalId":145222,"journal":{"name":"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSIPAASC47483.2019.9023107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In this paper, the performance of Affine-DTW, which performs appropriate time alignment between source and target features in voice conversion (VC), is experimentally and thoroughly investigated. In traditional VC, parallel data are often required to train a mapping model between source and target features. While VC with non-parallel data is also studied to avoid collecting parallel data, the quality of its converted speech is still inferior to the traditional one with parallel data. One approach to further progress in VC is exploiting both parallel and non-parallel data, the former of which is pre-stored and the latter of which is assumed to be easily collected. In this case, it is still worthwhile to study time-alignment techniques to obtain appropriate alignment of parallel data. Affine-DTW is a technique in which dynamic time warping (DTW) and coarse conversion based on affine transformation are iteratively performed. In Affine-DTW, time alignment and parameters of affine transformation can be analytically calculated so that it can be easily adopted as pre-processing in VC. However, the influence on the performance of trained models based on the obtained alignments has not been well investigated experimentally. Hence, this paper investigates the performance of Affine-DTW in terms of quality improvement of converted speech in traditional VC methods based on Gaussian mixture models, non-negative matrix factorization and neural networks. Experimental results show that Affine-DTW obtains appropriate alignments and the naturalness improvement of converted speech in subjective assessments is observed in trained models based on the alignments.

查看原文本刊更多论文

仿射- dtw对语音转换质量影响的实验研究

本文对仿射- dtw的性能进行了实验研究，该算法在语音转换(VC)中实现了源和目标特征之间的适当时间对准。在传统的VC中，通常需要并行数据来训练源特征和目标特征之间的映射模型。虽然也研究了非并行数据的VC，以避免收集并行数据，但其转换的语音质量仍然不如传统的并行数据的VC。进一步发展VC的一种方法是同时利用并行和非并行数据，前者是预先存储的，后者被认为很容易收集。在这种情况下，仍然值得研究时间对齐技术，以获得适当的并行数据对齐。仿射-DTW是一种基于仿射变换迭代进行动态时间规整和粗转换的技术。在仿射- dtw中，可以解析计算仿射变换的时间对准和参数，便于在VC中进行预处理。然而，基于所获得的对齐对训练模型性能的影响还没有得到很好的实验研究。因此，本文研究了基于高斯混合模型、非负矩阵分解和神经网络的仿射- dtw在提高传统VC方法转换语音质量方面的性能。实验结果表明，仿射- dtw得到了适当的对齐，并且在基于对齐的训练模型中观察到转换后的语音在主观评估中的自然度提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

自引率

0.00%

发文量