MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-01-17 DOI:10.21437/ssw.2023-8

Ondvrej Pl'atek, Ondrej Dusek

引用次数: 1

Abstract

We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even state-of-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model fine-tuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results, surpassing the SSL baseline on the VoiceMOS Challenge data.

查看原文本刊更多论文

MooseNet:基于PLDA模块的可训练合成语音度量

我们提出了MooseNet，一个可训练的语音指标，预测听众的平均意见得分(MOS)。我们提出了一种新的方法，在自监督学习(SSL)神经网络(NN)模型的嵌入上使用概率线性判别分析(PLDA)生成模型。我们表明，当只训练136个话语(大约一分钟的训练时间)时，PLDA与非微调SSL模型一起工作得很好，并且PLDA不断改进各种神经MOS预测模型，甚至是具有任务特定微调的最先进模型。我们的烧蚀研究表明，在低资源情况下，PLDA训练优于SSL模型微调。我们还使用方便的优化器选择和额外的对比和多任务训练目标来改进SSL模型微调。带有PLDA模块的经过微调的MooseNet NN达到了最佳效果，超过了VoiceMOS Challenge数据上的SSL基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

12th ISCA Speech Synthesis Workshop (SSW2023)

自引率

0.00%

发文量