{"title":"论用于单声道语音增强的复值变分 U 网络的泛化能力","authors":"Eike J. Nustede;Jörn Anemüller","doi":"10.1109/TASLP.2024.3444492","DOIUrl":null,"url":null,"abstract":"The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3838-3849"},"PeriodicalIF":4.1000,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637717","citationCount":"0","resultStr":"{\"title\":\"On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement\",\"authors\":\"Eike J. Nustede;Jörn Anemüller\",\"doi\":\"10.1109/TASLP.2024.3444492\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"3838-3849\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2024-08-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637717\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10637717/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10637717/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement
The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.
期刊介绍:
The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.