Fritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach, P. Wagner
{"title":"Re-examining the quality dimensions of synthetic speech","authors":"Fritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach, P. Wagner","doi":"10.21437/ssw.2023-6","DOIUrl":null,"url":null,"abstract":"The aim of this paper is to generate a more comprehensive framework for evaluating synthetic speech. To this end, a line of tests resulting in an exploratory factor analysis (EFA) have been carried out. The proposed dimensions that encapsulate the construct of “synthetic speech quality” are: “human-likeness”, “audio quality”, “negative emotion”, “dominance”, “positive emotion”, “calmness”, “seniority” and “gender”, with item-to-total correlations pointing towards “gender” being an orthogonal construct. A subsequent analysis on common acoustic features, found in forensic and phonetic literature, reveals very weak correlations with the proposed scales. Inter-rater and inter-item agreement measures additionally reveal low consistency within the scales. We also make the case that there is a need for a more fine grained approach when investigating the quality of synthetic speech systems, and propose a method that attempts to capture individual quality dimensions in the time domain.","PeriodicalId":346639,"journal":{"name":"12th ISCA Speech Synthesis Workshop (SSW2023)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"12th ISCA Speech Synthesis Workshop (SSW2023)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/ssw.2023-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The aim of this paper is to generate a more comprehensive framework for evaluating synthetic speech. To this end, a line of tests resulting in an exploratory factor analysis (EFA) have been carried out. The proposed dimensions that encapsulate the construct of “synthetic speech quality” are: “human-likeness”, “audio quality”, “negative emotion”, “dominance”, “positive emotion”, “calmness”, “seniority” and “gender”, with item-to-total correlations pointing towards “gender” being an orthogonal construct. A subsequent analysis on common acoustic features, found in forensic and phonetic literature, reveals very weak correlations with the proposed scales. Inter-rater and inter-item agreement measures additionally reveal low consistency within the scales. We also make the case that there is a need for a more fine grained approach when investigating the quality of synthetic speech systems, and propose a method that attempts to capture individual quality dimensions in the time domain.