{"title":"比较用于非侵入式语音质量预测的神经网络架构","authors":"Leif Førland Schill , Tobias Piechowiak , Clément Laroche , Pejman Mowlaee","doi":"10.1016/j.specom.2024.103123","DOIUrl":null,"url":null,"abstract":"<div><p>Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103123"},"PeriodicalIF":2.4000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000943/pdfft?md5=5812564c5b5fd37eb77c86b9c56fb655&pid=1-s2.0-S0167639324000943-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Comparing neural network architectures for non-intrusive speech quality prediction\",\"authors\":\"Leif Førland Schill , Tobias Piechowiak , Clément Laroche , Pejman Mowlaee\",\"doi\":\"10.1016/j.specom.2024.103123\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"165 \",\"pages\":\"Article 103123\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000943/pdfft?md5=5812564c5b5fd37eb77c86b9c56fb655&pid=1-s2.0-S0167639324000943-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000943\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000943","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
Comparing neural network architectures for non-intrusive speech quality prediction
Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.