Sparse representation of phonetic features for voice conversion with and without parallel data

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI:10.1109/ASRU.2017.8269002

Berrak Sisman, Haizhou Li, K. Tan

引用次数: 40

Abstract

This paper presents a voice conversion framework that uses phonetic information in an exemplar-based voice conversion approach. The proposed idea is motivated by the fact that phone-dependent exemplars lead to better estimation of activation matrix, therefore, possibly better conversion. We propose to use the phone segmentation results from automatic speech recognition (ASR) to construct a sub-dictionary for each phone. The proposed framework can work with or without parallel training data. With parallel training data, we found that phonetic sub-dictionary outperforms the state-of-the-art baseline in objective and subjective evaluations. Without parallel training data, we use Phonetic PosteriorGrams (PPGs) as the speaker-independent exemplars in the phonetic sub-dictionary to serve as a bridge between speakers. We report that such technique achieves a competitive performance without the need of parallel training data.

查看原文本刊更多论文

有无并行数据的语音转换语音特征的稀疏表示

本文提出了一种基于实例的语音转换方法中使用语音信息的语音转换框架。所提出的想法的动机是，手机依赖的例子导致更好的估计激活矩阵，因此，可能更好的转换。我们建议使用自动语音识别(ASR)的电话分割结果为每个电话构建子字典。所提出的框架可以使用或不使用并行训练数据。通过并行训练数据，我们发现语音子字典在客观和主观评价方面都优于最先进的基线。在没有并行训练数据的情况下，我们使用语音后图(PPGs)作为语音子字典中与说话人无关的样本，作为说话人之间的桥梁。我们报告说，这种技术在不需要并行训练数据的情况下实现了竞争性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量