Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation

Ambika Kirkland, Shivam Mehta, Harm Lameris, G. Henter, Éva Székely, Joakim Gustafson
{"title":"Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation","authors":"Ambika Kirkland, Shivam Mehta, Harm Lameris, G. Henter, Éva Székely, Joakim Gustafson","doi":"10.21437/ssw.2023-7","DOIUrl":null,"url":null,"abstract":"The Mean Opinion Score (MOS) is a prevalent metric in TTS evaluation. Although standards for collecting and reporting MOS exist, researchers seem to use the term inconsistently, and underreport the details of their testing methodologies. A survey of Interspeech and SSW papers from 2021-2022 shows that most authors do not report scale labels, increments, or instructions to participants, and those who do diverge in terms of their implementation. It is also unclear in many cases whether listeners were asked to rate naturalness, or overall quality. MOS obtained for natural speech using different testing methodologies vary in the surveyed papers: specifically, quality MOS is on average higher than naturalness MOS. We carried out several listening tests using the same stimuli but with differences in the scale increment and instructions about what participants should rate, and found that both of these variables affected MOS for some systems.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

The Mean Opinion Score (MOS) is a prevalent metric in TTS evaluation. Although standards for collecting and reporting MOS exist, researchers seem to use the term inconsistently, and underreport the details of their testing methodologies. A survey of Interspeech and SSW papers from 2021-2022 shows that most authors do not report scale labels, increments, or instructions to participants, and those who do diverge in terms of their implementation. It is also unclear in many cases whether listeners were asked to rate naturalness, or overall quality. MOS obtained for natural speech using different testing methodologies vary in the surveyed papers: specifically, quality MOS is on average higher than naturalness MOS. We carried out several listening tests using the same stimuli but with differences in the scale increment and instructions about what participants should rate, and found that both of these variables affected MOS for some systems.
陷入MOS陷阱:TTS评估中MOS测试方法的批判性分析
平均意见分数(MOS)是TTS评价中常用的指标。尽管存在收集和报告MOS的标准,但研究人员似乎不一致地使用了这个术语,并且低估了他们测试方法的细节。对Interspeech和SSW在2021-2022年间的论文进行的一项调查显示,大多数作者没有向参与者报告量表标签、增量或说明,而那些在实施方面存在分歧的作者。在许多情况下,也不清楚听众是被要求对自然程度进行评分,还是对整体质量进行评分。在被调查的论文中,使用不同测试方法获得的自然语音MOS各不相同:具体而言,质量MOS平均高于自然MOS。我们进行了几次听力测试,使用相同的刺激,但在尺度增量和参与者应该评分的指示上有所不同,并发现这两个变量都会影响某些系统的MOS。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信