{"title":"基于相关损失的MOS合成语音预测","authors":"Beibei Hu, Qiang Li","doi":"10.23919/APSIPAASC55919.2022.9980182","DOIUrl":null,"url":null,"abstract":"For the speech mean opinion score (MOS) prediction task, many deep-learning-based methods are developed. Generally, system-level and utterance-level mean squared error (MSE), Linear Correlation Coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC), and Kendall Tau Rank Correlation (KTAU) are leveraged as the evaluation metrics. However, we find that the objective functions for many MOS prediction networks are MAE or MSE based without an explicit correlation objective part. This paper investigates different correlation losses for voice MOS prediction networks. Based on the datasets and SSL-MOS baseline system provided by VoiceMOsChallenge 2022, we employ different auxiliary correlation losses to train the MOS prediction network. The experiment results show that the suggested auxiliary correlation losses increase the performance of the SSL-MOS network on the six correlation metrics. Compared with the two best-performing systems in the VoiceMOsChallenge 2022, our approach achieves close performance on the system-level correlation metrics with simpler system architecture.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Correlation Loss for MOS Prediction of Synthetic Speech\",\"authors\":\"Beibei Hu, Qiang Li\",\"doi\":\"10.23919/APSIPAASC55919.2022.9980182\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For the speech mean opinion score (MOS) prediction task, many deep-learning-based methods are developed. Generally, system-level and utterance-level mean squared error (MSE), Linear Correlation Coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC), and Kendall Tau Rank Correlation (KTAU) are leveraged as the evaluation metrics. However, we find that the objective functions for many MOS prediction networks are MAE or MSE based without an explicit correlation objective part. This paper investigates different correlation losses for voice MOS prediction networks. Based on the datasets and SSL-MOS baseline system provided by VoiceMOsChallenge 2022, we employ different auxiliary correlation losses to train the MOS prediction network. The experiment results show that the suggested auxiliary correlation losses increase the performance of the SSL-MOS network on the six correlation metrics. Compared with the two best-performing systems in the VoiceMOsChallenge 2022, our approach achieves close performance on the system-level correlation metrics with simpler system architecture.\",\"PeriodicalId\":382967,\"journal\":{\"name\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"volume\":\"124 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/APSIPAASC55919.2022.9980182\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/APSIPAASC55919.2022.9980182","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Correlation Loss for MOS Prediction of Synthetic Speech
For the speech mean opinion score (MOS) prediction task, many deep-learning-based methods are developed. Generally, system-level and utterance-level mean squared error (MSE), Linear Correlation Coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC), and Kendall Tau Rank Correlation (KTAU) are leveraged as the evaluation metrics. However, we find that the objective functions for many MOS prediction networks are MAE or MSE based without an explicit correlation objective part. This paper investigates different correlation losses for voice MOS prediction networks. Based on the datasets and SSL-MOS baseline system provided by VoiceMOsChallenge 2022, we employ different auxiliary correlation losses to train the MOS prediction network. The experiment results show that the suggested auxiliary correlation losses increase the performance of the SSL-MOS network on the six correlation metrics. Compared with the two best-performing systems in the VoiceMOsChallenge 2022, our approach achieves close performance on the system-level correlation metrics with simpler system architecture.