比较用于非侵入式语音质量预测的神经网络架构

IF 2.4 3区 计算机科学 Q2 ACOUSTICS
Leif Førland Schill , Tobias Piechowiak , Clément Laroche , Pejman Mowlaee
{"title":"比较用于非侵入式语音质量预测的神经网络架构","authors":"Leif Førland Schill ,&nbsp;Tobias Piechowiak ,&nbsp;Clément Laroche ,&nbsp;Pejman Mowlaee","doi":"10.1016/j.specom.2024.103123","DOIUrl":null,"url":null,"abstract":"<div><p>Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"165 ","pages":"Article 103123"},"PeriodicalIF":2.4000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000943/pdfft?md5=5812564c5b5fd37eb77c86b9c56fb655&pid=1-s2.0-S0167639324000943-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Comparing neural network architectures for non-intrusive speech quality prediction\",\"authors\":\"Leif Førland Schill ,&nbsp;Tobias Piechowiak ,&nbsp;Clément Laroche ,&nbsp;Pejman Mowlaee\",\"doi\":\"10.1016/j.specom.2024.103123\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"165 \",\"pages\":\"Article 103123\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000943/pdfft?md5=5812564c5b5fd37eb77c86b9c56fb655&pid=1-s2.0-S0167639324000943-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000943\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000943","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

非侵入式语音质量预测器可在不使用参考信号的情况下评估语音质量,因此在许多实际应用中都非常有用。最近,神经网络在这项任务中表现出了最佳性能。文献中的两个此类模型是基于卷积神经网络的 DNSMOS 和基于双向长短期记忆的 Quality-Net,它们最初分别用于预测主观目标和侵入性 PESQ 分数。本文在单一数据集上对这两种架构进行了训练,并将其用于预测侵入性 ViSQOL 分数。评估是在具有各种不匹配条件的测试集上进行的,包括未见过的语音和噪音语料库,以及常见的 IP 语音失真。实验结果表明,这些模型对训练分布具有相似的预测能力,对新的噪音和语音语料具有良好的泛化能力。在这两个模型中,看不见的失真被认为是泛化效果较差的领域,尤其是 DNSMOS。我们的结果还表明,训练集中普遍存在的环境噪声会在泛化到某些类型的噪声时造成问题。最后,我们详细介绍了 ViSQOL 分数如何与参考压力水平和语音活动水平产生不良依赖关系。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Comparing neural network architectures for non-intrusive speech quality prediction

Non-intrusive speech quality predictors evaluate speech quality without the use of a reference signal, making them useful in many practical applications. Recently, neural networks have shown the best performance for this task. Two such models in the literature are the convolutional neural network based DNSMOS and the bi-directional long short-term memory based Quality-Net, which were originally trained to predict subjective targets and intrusive PESQ scores, respectively. In this paper, these two architectures are trained on a single dataset, and used to predict the intrusive ViSQOL score. The evaluation is done on a number of test sets with a variety of mismatch conditions, including unseen speech and noise corpora, and common voice over IP distortions. The experiments show that the models achieve similar predictive ability on the training distribution, and overall good generalization to new noise and speech corpora. Unseen distortions are identified as an area where both models generalize poorly, especially DNSMOS. Our results also suggest that a pervasiveness of ambient noise in the training set can cause problems when generalizing to certain types of noise. Finally, we detail how the ViSQOL score can have undesirable dependencies on the reference pressure level and the voice activity level.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信