Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation

12th ISCA Speech Synthesis Workshop (SSW2023) Pub Date : 2023-08-26 DOI:10.21437/ssw.2023-7

Ambika Kirkland, Shivam Mehta, Harm Lameris, G. Henter, Éva Székely, Joakim Gustafson

引用次数: 3

Abstract

The Mean Opinion Score (MOS) is a prevalent metric in TTS evaluation. Although standards for collecting and reporting MOS exist, researchers seem to use the term inconsistently, and underreport the details of their testing methodologies. A survey of Interspeech and SSW papers from 2021-2022 shows that most authors do not report scale labels, increments, or instructions to participants, and those who do diverge in terms of their implementation. It is also unclear in many cases whether listeners were asked to rate naturalness, or overall quality. MOS obtained for natural speech using different testing methodologies vary in the surveyed papers: specifically, quality MOS is on average higher than naturalness MOS. We carried out several listening tests using the same stimuli but with differences in the scale increment and instructions about what participants should rate, and found that both of these variables affected MOS for some systems.

查看原文本刊更多论文

陷入MOS陷阱:TTS评估中MOS测试方法的批判性分析

平均意见分数(MOS)是TTS评价中常用的指标。尽管存在收集和报告MOS的标准，但研究人员似乎不一致地使用了这个术语，并且低估了他们测试方法的细节。对Interspeech和SSW在2021-2022年间的论文进行的一项调查显示，大多数作者没有向参与者报告量表标签、增量或说明，而那些在实施方面存在分歧的作者。在许多情况下，也不清楚听众是被要求对自然程度进行评分，还是对整体质量进行评分。在被调查的论文中，使用不同测试方法获得的自然语音MOS各不相同:具体而言，质量MOS平均高于自然MOS。我们进行了几次听力测试，使用相同的刺激，但在尺度增量和参与者应该评分的指示上有所不同，并发现这两个变量都会影响某些系统的MOS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

12th ISCA Speech Synthesis Workshop (SSW2023)

自引率

0.00%

发文量