{"title":"The VoiceMOS Challenge 2024: Beyond Speech Quality Prediction","authors":"Wen-Chin Huang, Szu-Wei Fu, Erica Cooper, Ryandhimas E. Zezario, Tomoki Toda, Hsin-Min Wang, Junichi Yamagishi, Yu Tsao","doi":"arxiv-2409.07001","DOIUrl":null,"url":null,"abstract":"We present the third edition of the VoiceMOS Challenge, a scientific\ninitiative designed to advance research into automatic prediction of human\nspeech ratings. There were three tracks. The first track was on predicting the\nquality of ``zoomed-in'' high-quality samples from speech synthesis systems.\nThe second track was to predict ratings of samples from singing voice synthesis\nand voice conversion with a large variety of systems, listeners, and languages.\nThe third track was semi-supervised quality prediction for noisy, clean, and\nenhanced speech, where a very small amount of labeled training data was\nprovided. Among the eight teams from both academia and industry, we found that\nmany were able to outperform the baseline systems. Successful techniques\nincluded retrieval-based methods and the use of non-self-supervised\nrepresentations like spectrograms and pitch histograms. These results showed\nthat the challenge has advanced the field of subjective speech rating\nprediction.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"28 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We present the third edition of the VoiceMOS Challenge, a scientific
initiative designed to advance research into automatic prediction of human
speech ratings. There were three tracks. The first track was on predicting the
quality of ``zoomed-in'' high-quality samples from speech synthesis systems.
The second track was to predict ratings of samples from singing voice synthesis
and voice conversion with a large variety of systems, listeners, and languages.
The third track was semi-supervised quality prediction for noisy, clean, and
enhanced speech, where a very small amount of labeled training data was
provided. Among the eight teams from both academia and industry, we found that
many were able to outperform the baseline systems. Successful techniques
included retrieval-based methods and the use of non-self-supervised
representations like spectrograms and pitch histograms. These results showed
that the challenge has advanced the field of subjective speech rating
prediction.