IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

筛选
英文 中文
Interference-Controlled Maximum Noise Reduction Beamformer Based on Deep-Learned Interference Manifold 基于深度学习干扰矩阵的干扰控制型最大降噪波束成形器
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485551
Yichen Yang;Ningning Pan;Wen Zhang;Chao Pan;Jacob Benesty;Jingdong Chen
{"title":"Interference-Controlled Maximum Noise Reduction Beamformer Based on Deep-Learned Interference Manifold","authors":"Yichen Yang;Ningning Pan;Wen Zhang;Chao Pan;Jacob Benesty;Jingdong Chen","doi":"10.1109/TASLP.2024.3485551","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485551","url":null,"abstract":"Beamforming has been used in a wide range of applications to extract the signal of interest from microphone array observations, which consist of not only the signal of interest, but also noise, interference, and reverberation. The recently proposed interference-controlled maximum noise reduction (ICMR) beamformer provides a flexible way to control the specified amount of the interference attenuation and noise suppression; but it requires accurate estimation of the manifold vector of the interference sources, which is challenging to achieve in real-world applications. To address this issue, we introduce an interference-controlled maximum noise reduction network (ICMRNet) in this study, which is a deep neural network (DNN)-based method for manifold vector estimation. With densely connected modified conformer blocks and the end-to-end training strategy, the interference manifold is learned directly from the observation signals. This approach, akin to ICMR, adeptly adapts to time-varying interference and demonstrates superior convergence rate and extraction efficacy as compared to the linearly constrained minimum variance (LCMV)-based neural beamformers when appropriate attenuation factors are selected. Moreover, via learning-based extraction, ICMRNet effectively suppresses reverberation components within the target signal. Comparative analysis against baseline methods validates the efficacy of the proposed method.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4676-4690"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning 为基于外推法的时态知识图谱推理学习动态和静态表征
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485500
Pengfei Li;Guangyou Zhou;Zhiwen Xie;Penghui Xie;Jimmy Xiangji Huang
{"title":"Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning","authors":"Pengfei Li;Guangyou Zhou;Zhiwen Xie;Penghui Xie;Jimmy Xiangji Huang","doi":"10.1109/TASLP.2024.3485500","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485500","url":null,"abstract":"Temporal knowledge graph reasoning aims to predict the missing links (facts) in the future timestamps. However, most existing methods have a common limitation: they focus on learning dynamic representations of temporal knowledge graphs and rarely consider static characteristics that remain unchanged over time. To address the above issues, we propose to learn the dynamic and static representations for temporal knowledge graph reasoning (DSTKG), which introduces two latent variables to capture the dynamic and static characteristics of entities in temporal knowledge graphs. First, we use a Bi-GRU-based inference network to learn the static latent representation of historical facts and a nonlinear discrete-time transition-based inference network to learn the dynamic latent representation. Then, we sample the latent variables multiple times using re-parameterization tricks to obtain high-quality embeddings and make predictions in the future timestamps. The empirical results on four benchmark datasets show that our model is more effective than state-of-the-art approaches. Compared with the strong baseline model DBKGE (RotatE), the proposed model achieves performance improvements of 2.69%, \u0000<inline-formula><tex-math>$1.59%$</tex-math></inline-formula>\u0000, 1.18% and 1.22% on Yago11k, Wikidata12k, ICEWS14 and ICEWS05-15 respectively, regarding the evaluation metric MRR.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4741-4754"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models 在大型语言模型中进行无衍生优化以实现低ank自适应
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-09 DOI: 10.1109/TASLP.2024.3477330
Feihu Jin;Yifan Liu;Ying Tan
{"title":"Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models","authors":"Feihu Jin;Yifan Liu;Ying Tan","doi":"10.1109/TASLP.2024.3477330","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477330","url":null,"abstract":"Parameter-efficient tuning methods such as LoRA could achieve comparable performance to model tuning by tuning a small portion of the parameters. However, substantial computational resources are still required, as this process involves calculating gradients and performing back-propagation throughout the model. Much effort has recently been devoted to utilizing the derivative-free optimization methods to eschew the computation of gradients and showcase an augmented level of robustness in few-shot settings. In this paper, we prepend the low-rank modules into each self-attention layer of the model and employ two derivative-free optimization methods to optimize these low-rank modules at each layer alternately. Extensive results on various tasks and language models demonstrate that our proposed method achieves substantial improvement and exhibits clear advantages in memory usage and convergence speed compared to existing gradient-based parameter-efficient tuning and derivative-free optimization methods in few-shot settings.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4607-4616"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Smoothed Frame-Level SINR and Its Estimation for Sensor Selection in Distributed Acoustic Sensor Networks 分布式声学传感器网络中用于传感器选择的平滑帧级 SINR 及其估算
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-09 DOI: 10.1109/TASLP.2024.3477277
Shanzheng Guan;Mou Wang;Zhongxin Bai;Jianyu Wang;Jingdong Chen;Jacob Benesty
{"title":"Smoothed Frame-Level SINR and Its Estimation for Sensor Selection in Distributed Acoustic Sensor Networks","authors":"Shanzheng Guan;Mou Wang;Zhongxin Bai;Jianyu Wang;Jingdong Chen;Jacob Benesty","doi":"10.1109/TASLP.2024.3477277","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477277","url":null,"abstract":"Distributed acoustic sensor network (DASN) refers to a sound acquisition system that consists of a collection of microphones randomly distributed across a wide acoustic area. Theory and methods for DASN are gaining increasing attention as the associated technologies can be used in a broad range of applications to solve challenging problems. However, unlike traditional microphone arrays or centralized systems, properly exploiting the redundancy among different channels in DASN is facing many challenges including but not limited to variations in pre-amplification gains, clocks, sensors' response, and signal-to-interference-plus-noise ratios (SINRs). Selecting appropriate sensors relevant to the task at hand is therefore crucial in DASN. In this work, we propose a speaker-dependent smoothed frame-level SINR estimation method for sensor selection in multi-speaker scenarios, specifically addressing source movement within DASN. Additionally, we devise an approach for similarity measurement to generate dynamic speaker embeddings resilient to variations in reference speech levels. Furthermore, we introduce a novel loss function that integrates classification and ordinal regression within a unified framework. Extensive simulations are performed and the results demonstrate the efficacy of the proposed method in accurately estimating smoothed frame-level SINR dynamically, yielding state-of-the-art performance.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4554-4568"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142517851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model 改进客观感知音频质量评估 - 第一部分:新颖的数据驱动认知模型
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-09 DOI: 10.1109/TASLP.2024.3477291
Pablo M. Delgado;Jürgen Herre
{"title":"Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model","authors":"Pablo M. Delgado;Jürgen Herre","doi":"10.1109/TASLP.2024.3477291","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477291","url":null,"abstract":"Efficient audioquality assessment is vital for streamlining audio codec development. Objective assessment tools have been developed over time to algorithmically predict quality ratings from subjective assessments, the gold standard for quality judgment. Many of these tools use perceptual auditory models to extract audio features that are mapped to a basic audio quality score prediction using machine learning algorithms and subjective scores as training data. However, existing tools struggle with generalization in quality prediction, especially when faced with unknown signal and distortion types. This is particularly evident in the presence of signals coded using non-waveform-preserving parametric techniques. Addressing these challenges, this two-part work proposes extensions to the Perceptual Evaluation of Audio Quality (PEAQ - ITU-R BS.1387-1) recommendation. Part 1 focuses on increasing generalization, while Part 2 targets accurate spatial audio quality measurement in audio coding. To enhance prediction generalization, this paper (Part 1) introduces a novel machine learning approach that uses subjective data to model cognitive aspects of audio quality perception. The proposed method models the perceived severity of audible distortions by adaptively weighting different distortion metrics. The weights are determined using an interaction cost function that captures relationships between distortion salience and cognitive effects. Compared to other machine learning methods and established tools, the proposed architecture achieves higher prediction accuracy on large databases of previously unseen subjective quality scores. The perceptually-motivated model offers a more manageable alternative to general-purpose machine learning algorithms, allowing potential extensions and improvements to multi-dimensional quality measurement without complete retraining.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4661-4675"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Audio-Only Phonetic Segment Classification Using Embeddings Learned From Audio and Ultrasound Tongue Imaging Data 利用从音频和超声波舌头成像数据中学习的嵌入式技术进行纯音频音段分类
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-07 DOI: 10.1109/TASLP.2024.3473316
Ilhan Aytutuldu;Yakup Genc;Yusuf Sinan Akgul
{"title":"Audio-Only Phonetic Segment Classification Using Embeddings Learned From Audio and Ultrasound Tongue Imaging Data","authors":"Ilhan Aytutuldu;Yakup Genc;Yusuf Sinan Akgul","doi":"10.1109/TASLP.2024.3473316","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473316","url":null,"abstract":"This paper presents a phonetic segment classification method based on joint embeddings learned from processing Ultrasound Tongue Imaging (UTI) and audio data. For constructing the embeddings, we compiled an ultrasound image dataset synchronized with audio that encompasses common speech scenarios. The embeddings are obtained from artificial neural network models trained on this dataset. During testing, our model processes only audio data, making it practical for speech therapy as no ultrasound imaging is required. Experiments show that our method yields similar performance compared to methods that simultaneously use both audio and UTI data. However, it outperforms the methods utilizing solely audio or UTI data in real-time classification.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4501-4510"},"PeriodicalIF":4.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142452685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Investigating the Design Space of Diffusion Models for Speech Enhancement 调查用于语音增强的扩散模型设计空间
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-03 DOI: 10.1109/TASLP.2024.3473319
Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May
{"title":"Investigating the Design Space of Diffusion Models for Speech Enhancement","authors":"Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May","doi":"10.1109/TASLP.2024.3473319","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473319","url":null,"abstract":"Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4486-4500"},"PeriodicalIF":4.1,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704960","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-Time Multichannel Deep Speech Enhancement in Hearing Aids: Comparing Monaural and Binaural Processing in Complex Acoustic Scenarios 助听器中的实时多通道深度语音增强:比较复杂声学场景中的单声道和双声道处理技术
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473315
Nils L. Westhausen;Hendrik Kayser;Theresa Jansen;Bernd T. Meyer
{"title":"Real-Time Multichannel Deep Speech Enhancement in Hearing Aids: Comparing Monaural and Binaural Processing in Complex Acoustic Scenarios","authors":"Nils L. Westhausen;Hendrik Kayser;Theresa Jansen;Bernd T. Meyer","doi":"10.1109/TASLP.2024.3473315","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473315","url":null,"abstract":"Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorithms in two complex acoustic scenes. Both algorithms are evaluated with objective metrics and in experiments with hearing-impaired listeners performing a speech-in-noise test. Results are compared to two traditional enhancement strategies, i.e., adaptive differential microphone processing and binaural beamforming. While in diffuse noise, all algorithms perform similarly, the binaural deep learning approach performs best in the presence of spatial interferers. Through a post-analysis, this can be attributed to improvements at low SNRs and to precise spatial filtering.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4596-4606"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction RISC:用于呼喊类型分类和呼喊强度预测的语料库
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473302
Takahiro Fukumori;Taito Ishida;Yoichi Yamashita
{"title":"RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction","authors":"Takahiro Fukumori;Taito Ishida;Yoichi Yamashita","doi":"10.1109/TASLP.2024.3473302","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473302","url":null,"abstract":"The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meaning that classifiers cannot learn to discriminate such utterances from shouts typical of less hazardous situations such as cheers. Thus, this paper presents a novel research source, the RItsumeikan Shout Corpus (RISC), which contains wide variety types of shouted speech samples collected in recording experiments. Each shouted speech sample in RISC has a shout type and is also assigned shout intensity ratings via a crowdsourcing service. We also present a comprehensive performance comparison among deep learning approaches for speech type classification tasks and a shout intensity prediction task. The results show that feature learning based on the spectral and cepstral domains achieves high performance, no matter which network architecture is used. The results also demonstrate that shout type classification and intensity prediction are still challenging tasks, and RISC is expected to contribute to further development in this research area.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4434-4444"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142434604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised Speech Enhancement Using Optimal Transport and Speech Presence Probability 利用最佳传输和语音存在概率进行无监督语音增强
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473318
Wenbin Jiang;Kai Yu;Fei Wen
{"title":"Unsupervised Speech Enhancement Using Optimal Transport and Speech Presence Probability","authors":"Wenbin Jiang;Kai Yu;Fei Wen","doi":"10.1109/TASLP.2024.3473318","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473318","url":null,"abstract":"Speech enhancement models based on deep learning are typically trained in a supervised manner, requiring a substantial amount of paired noisy-to-clean speech data for training. However, synthetically generated training data can only capture a limited range of realistic environments, and it is often challenging or even impractical to gather real-world pairs of noisy and ground-truth clean speech. To overcome this limitation, we propose an unsupervised learning approach for speech enhancement that eliminates the need for paired noisy-to-clean training data. Specifically, our method utilizes the optimal transport criterion to train the speech enhancement model in an unsupervised manner. It employs a fidelity loss based on noisy speech and a distribution divergence loss to minimize the difference between the distribution of the model's output and that of unpaired clean speech. Further, we use the speech presence probability as an additional optimization objective and incorporate the short-time Fourier transform (STFT) domain loss as an extra term for the unsupervised learning loss. We also apply the multi-resolution STFT loss as the validation loss to enhance the stability of the training process and improve the algorithm's performance. Experimental results on the VCTK + DEMAND benchmark demonstrate that the proposed method achieves competitive performance compared to the supervised methods. Furthermore, the speech recognition results on the CHiME4 benchmark show the superiority of the proposed method over its supervised counterpart.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4445-4455"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142434621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信