{"title":"Importance of Human Factors in Text-To-Speech Evaluations","authors":"L. Finkelstein, Joshua Camp, R. Clark","doi":"10.21437/ssw.2023-5","DOIUrl":null,"url":null,"abstract":"Both mean opinion score (MOS) evaluations and preference tests in text-to-speech are often associated with high rating variance. In this paper we investigate two important factors that affect that variance. One factor is that the variance is coming from how raters are picked for a specific test, and another is the dynamic behavior of individual raters across time. This paper increases the awareness of these issues when designing an evaluation experiment, since the standard confidence interval on the test level cannot incorporate the variance associated with these two factors. We show the impact of the two sources of variance and how they can be mitigated. We demonstrate that simple improvements in experiment design such as using a smaller number of rating tasks per rater can significantly improve the experiment confidence intervals / reproducibility with no extra cost.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Both mean opinion score (MOS) evaluations and preference tests in text-to-speech are often associated with high rating variance. In this paper we investigate two important factors that affect that variance. One factor is that the variance is coming from how raters are picked for a specific test, and another is the dynamic behavior of individual raters across time. This paper increases the awareness of these issues when designing an evaluation experiment, since the standard confidence interval on the test level cannot incorporate the variance associated with these two factors. We show the impact of the two sources of variance and how they can be mitigated. We demonstrate that simple improvements in experiment design such as using a smaller number of rating tasks per rater can significantly improve the experiment confidence intervals / reproducibility with no extra cost.