Re-examining the quality dimensions of synthetic speech

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI:10.21437/ssw.2023-6

Fritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach, P. Wagner

引用次数: 0

Abstract

The aim of this paper is to generate a more comprehensive framework for evaluating synthetic speech. To this end, a line of tests resulting in an exploratory factor analysis (EFA) have been carried out. The proposed dimensions that encapsulate the construct of “synthetic speech quality” are: “human-likeness”, “audio quality”, “negative emotion”, “dominance”, “positive emotion”, “calmness”, “seniority” and “gender”, with item-to-total correlations pointing towards “gender” being an orthogonal construct. A subsequent analysis on common acoustic features, found in forensic and phonetic literature, reveals very weak correlations with the proposed scales. Inter-rater and inter-item agreement measures additionally reveal low consistency within the scales. We also make the case that there is a need for a more fine grained approach when investigating the quality of synthetic speech systems, and propose a method that attempts to capture individual quality dimensions in the time domain.

查看原文本刊更多论文

重新审视合成语音的质量维度

本文的目的是生成一个更全面的评估合成语音的框架。为此目的，进行了一系列测试，结果是探索性因素分析(EFA)。概括“合成语音质量”结构的建议维度是:“与人类相似”、“音频质量”、“负面情绪”、“主导地位”、“积极情绪”、“冷静”、“资历”和“性别”，项目与总相关性指向“性别”是一个正交结构。随后对法医学和语音学文献中发现的常见声学特征的分析表明，与所提出的音阶之间的相关性非常弱。评价者之间和项目之间的一致性测量也表明量表内的一致性较低。我们还提出，在研究合成语音系统的质量时，需要一种更细粒度的方法，并提出了一种在时域中试图捕获单个质量维度的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

12th ISCA Speech Synthesis Workshop (SSW2023)

自引率

0.00%

发文量