{"title":"Interference-Controlled Maximum Noise Reduction Beamformer Based on Deep-Learned Interference Manifold","authors":"Yichen Yang;Ningning Pan;Wen Zhang;Chao Pan;Jacob Benesty;Jingdong Chen","doi":"10.1109/TASLP.2024.3485551","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485551","url":null,"abstract":"Beamforming has been used in a wide range of applications to extract the signal of interest from microphone array observations, which consist of not only the signal of interest, but also noise, interference, and reverberation. The recently proposed interference-controlled maximum noise reduction (ICMR) beamformer provides a flexible way to control the specified amount of the interference attenuation and noise suppression; but it requires accurate estimation of the manifold vector of the interference sources, which is challenging to achieve in real-world applications. To address this issue, we introduce an interference-controlled maximum noise reduction network (ICMRNet) in this study, which is a deep neural network (DNN)-based method for manifold vector estimation. With densely connected modified conformer blocks and the end-to-end training strategy, the interference manifold is learned directly from the observation signals. This approach, akin to ICMR, adeptly adapts to time-varying interference and demonstrates superior convergence rate and extraction efficacy as compared to the linearly constrained minimum variance (LCMV)-based neural beamformers when appropriate attenuation factors are selected. Moreover, via learning-based extraction, ICMRNet effectively suppresses reverberation components within the target signal. Comparative analysis against baseline methods validates the efficacy of the proposed method.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4676-4690"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning","authors":"Pengfei Li;Guangyou Zhou;Zhiwen Xie;Penghui Xie;Jimmy Xiangji Huang","doi":"10.1109/TASLP.2024.3485500","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485500","url":null,"abstract":"Temporal knowledge graph reasoning aims to predict the missing links (facts) in the future timestamps. However, most existing methods have a common limitation: they focus on learning dynamic representations of temporal knowledge graphs and rarely consider static characteristics that remain unchanged over time. To address the above issues, we propose to learn the dynamic and static representations for temporal knowledge graph reasoning (DSTKG), which introduces two latent variables to capture the dynamic and static characteristics of entities in temporal knowledge graphs. First, we use a Bi-GRU-based inference network to learn the static latent representation of historical facts and a nonlinear discrete-time transition-based inference network to learn the dynamic latent representation. Then, we sample the latent variables multiple times using re-parameterization tricks to obtain high-quality embeddings and make predictions in the future timestamps. The empirical results on four benchmark datasets show that our model is more effective than state-of-the-art approaches. Compared with the strong baseline model DBKGE (RotatE), the proposed model achieves performance improvements of 2.69%, \u0000<inline-formula><tex-math>$1.59%$</tex-math></inline-formula>\u0000, 1.18% and 1.22% on Yago11k, Wikidata12k, ICEWS14 and ICEWS05-15 respectively, regarding the evaluation metric MRR.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4741-4754"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models","authors":"Feihu Jin;Yifan Liu;Ying Tan","doi":"10.1109/TASLP.2024.3477330","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477330","url":null,"abstract":"Parameter-efficient tuning methods such as LoRA could achieve comparable performance to model tuning by tuning a small portion of the parameters. However, substantial computational resources are still required, as this process involves calculating gradients and performing back-propagation throughout the model. Much effort has recently been devoted to utilizing the derivative-free optimization methods to eschew the computation of gradients and showcase an augmented level of robustness in few-shot settings. In this paper, we prepend the low-rank modules into each self-attention layer of the model and employ two derivative-free optimization methods to optimize these low-rank modules at each layer alternately. Extensive results on various tasks and language models demonstrate that our proposed method achieves substantial improvement and exhibits clear advantages in memory usage and convergence speed compared to existing gradient-based parameter-efficient tuning and derivative-free optimization methods in few-shot settings.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4607-4616"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Smoothed Frame-Level SINR and Its Estimation for Sensor Selection in Distributed Acoustic Sensor Networks","authors":"Shanzheng Guan;Mou Wang;Zhongxin Bai;Jianyu Wang;Jingdong Chen;Jacob Benesty","doi":"10.1109/TASLP.2024.3477277","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477277","url":null,"abstract":"Distributed acoustic sensor network (DASN) refers to a sound acquisition system that consists of a collection of microphones randomly distributed across a wide acoustic area. Theory and methods for DASN are gaining increasing attention as the associated technologies can be used in a broad range of applications to solve challenging problems. However, unlike traditional microphone arrays or centralized systems, properly exploiting the redundancy among different channels in DASN is facing many challenges including but not limited to variations in pre-amplification gains, clocks, sensors' response, and signal-to-interference-plus-noise ratios (SINRs). Selecting appropriate sensors relevant to the task at hand is therefore crucial in DASN. In this work, we propose a speaker-dependent smoothed frame-level SINR estimation method for sensor selection in multi-speaker scenarios, specifically addressing source movement within DASN. Additionally, we devise an approach for similarity measurement to generate dynamic speaker embeddings resilient to variations in reference speech levels. Furthermore, we introduce a novel loss function that integrates classification and ordinal regression within a unified framework. Extensive simulations are performed and the results demonstrate the efficacy of the proposed method in accurately estimating smoothed frame-level SINR dynamically, yielding state-of-the-art performance.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4554-4568"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142517851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model","authors":"Pablo M. Delgado;Jürgen Herre","doi":"10.1109/TASLP.2024.3477291","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477291","url":null,"abstract":"Efficient audioquality assessment is vital for streamlining audio codec development. Objective assessment tools have been developed over time to algorithmically predict quality ratings from subjective assessments, the gold standard for quality judgment. Many of these tools use perceptual auditory models to extract audio features that are mapped to a basic audio quality score prediction using machine learning algorithms and subjective scores as training data. However, existing tools struggle with generalization in quality prediction, especially when faced with unknown signal and distortion types. This is particularly evident in the presence of signals coded using non-waveform-preserving parametric techniques. Addressing these challenges, this two-part work proposes extensions to the Perceptual Evaluation of Audio Quality (PEAQ - ITU-R BS.1387-1) recommendation. Part 1 focuses on increasing generalization, while Part 2 targets accurate spatial audio quality measurement in audio coding. To enhance prediction generalization, this paper (Part 1) introduces a novel machine learning approach that uses subjective data to model cognitive aspects of audio quality perception. The proposed method models the perceived severity of audible distortions by adaptively weighting different distortion metrics. The weights are determined using an interaction cost function that captures relationships between distortion salience and cognitive effects. Compared to other machine learning methods and established tools, the proposed architecture achieves higher prediction accuracy on large databases of previously unseen subjective quality scores. The perceptually-motivated model offers a more manageable alternative to general-purpose machine learning algorithms, allowing potential extensions and improvements to multi-dimensional quality measurement without complete retraining.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4661-4675"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Audio-Only Phonetic Segment Classification Using Embeddings Learned From Audio and Ultrasound Tongue Imaging Data","authors":"Ilhan Aytutuldu;Yakup Genc;Yusuf Sinan Akgul","doi":"10.1109/TASLP.2024.3473316","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473316","url":null,"abstract":"This paper presents a phonetic segment classification method based on joint embeddings learned from processing Ultrasound Tongue Imaging (UTI) and audio data. For constructing the embeddings, we compiled an ultrasound image dataset synchronized with audio that encompasses common speech scenarios. The embeddings are obtained from artificial neural network models trained on this dataset. During testing, our model processes only audio data, making it practical for speech therapy as no ultrasound imaging is required. Experiments show that our method yields similar performance compared to methods that simultaneously use both audio and UTI data. However, it outperforms the methods utilizing solely audio or UTI data in real-time classification.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4501-4510"},"PeriodicalIF":4.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142452685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May
{"title":"Investigating the Design Space of Diffusion Models for Speech Enhancement","authors":"Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May","doi":"10.1109/TASLP.2024.3473319","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473319","url":null,"abstract":"Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4486-4500"},"PeriodicalIF":4.1,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704960","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nils L. Westhausen;Hendrik Kayser;Theresa Jansen;Bernd T. Meyer
{"title":"Real-Time Multichannel Deep Speech Enhancement in Hearing Aids: Comparing Monaural and Binaural Processing in Complex Acoustic Scenarios","authors":"Nils L. Westhausen;Hendrik Kayser;Theresa Jansen;Bernd T. Meyer","doi":"10.1109/TASLP.2024.3473315","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473315","url":null,"abstract":"Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorithms in two complex acoustic scenes. Both algorithms are evaluated with objective metrics and in experiments with hearing-impaired listeners performing a speech-in-noise test. Results are compared to two traditional enhancement strategies, i.e., adaptive differential microphone processing and binaural beamforming. While in diffuse noise, all algorithms perform similarly, the binaural deep learning approach performs best in the presence of spatial interferers. Through a post-analysis, this can be attributed to improvements at low SNRs and to precise spatial filtering.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4596-4606"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction","authors":"Takahiro Fukumori;Taito Ishida;Yoichi Yamashita","doi":"10.1109/TASLP.2024.3473302","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473302","url":null,"abstract":"The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meaning that classifiers cannot learn to discriminate such utterances from shouts typical of less hazardous situations such as cheers. Thus, this paper presents a novel research source, the RItsumeikan Shout Corpus (RISC), which contains wide variety types of shouted speech samples collected in recording experiments. Each shouted speech sample in RISC has a shout type and is also assigned shout intensity ratings via a crowdsourcing service. We also present a comprehensive performance comparison among deep learning approaches for speech type classification tasks and a shout intensity prediction task. The results show that feature learning based on the spectral and cepstral domains achieves high performance, no matter which network architecture is used. The results also demonstrate that shout type classification and intensity prediction are still challenging tasks, and RISC is expected to contribute to further development in this research area.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4434-4444"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142434604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Speech Enhancement Using Optimal Transport and Speech Presence Probability","authors":"Wenbin Jiang;Kai Yu;Fei Wen","doi":"10.1109/TASLP.2024.3473318","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473318","url":null,"abstract":"Speech enhancement models based on deep learning are typically trained in a supervised manner, requiring a substantial amount of paired noisy-to-clean speech data for training. However, synthetically generated training data can only capture a limited range of realistic environments, and it is often challenging or even impractical to gather real-world pairs of noisy and ground-truth clean speech. To overcome this limitation, we propose an unsupervised learning approach for speech enhancement that eliminates the need for paired noisy-to-clean training data. Specifically, our method utilizes the optimal transport criterion to train the speech enhancement model in an unsupervised manner. It employs a fidelity loss based on noisy speech and a distribution divergence loss to minimize the difference between the distribution of the model's output and that of unpaired clean speech. Further, we use the speech presence probability as an additional optimization objective and incorporate the short-time Fourier transform (STFT) domain loss as an extra term for the unsupervised learning loss. We also apply the multi-resolution STFT loss as the validation loss to enhance the stability of the training process and improve the algorithm's performance. Experimental results on the VCTK + DEMAND benchmark demonstrate that the proposed method achieves competitive performance compared to the supervised methods. Furthermore, the speech recognition results on the CHiME4 benchmark show the superiority of the proposed method over its supervised counterpart.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4445-4455"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142434621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}