{"title":"Spherically Steerable Vector Differential Microphone Arrays","authors":"Hüseyin Hacıhabiboğlu","doi":"10.1109/TASLP.2024.3458799","DOIUrl":"10.1109/TASLP.2024.3458799","url":null,"abstract":"Differential microphone arrays (DMAs) use multiple omnidirectional microphones for synthesising higher-order microphone directivity patterns. In their most basic form, they can be used to obtain fixed-directivity or horizontally steerable beamformers that can satisfy certain constraints. We propose a vector differential microphone array (VDMA) which is frequency- and direction-invariantly steerable in three dimensions. The proposed design comprises pressure and particle velocity sensors positioned on a circular constellation in a plane and allows extracting the third-order spherical harmonic decomposition of the sound field. This decomposition can then be used to obtain spherically direction-invariant steered beams. Synthesis of a maximum directivity factor (MaxDF) directivity pattern is demonstrated. A closed-form expression for the proposed array's white noise gain (WNG) is derived. The robustness of the proposed design to noise is analysed.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4342-4354"},"PeriodicalIF":4.1,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-Supervised Learning of Spatial Acoustic Representation With Cross-Channel Signal Reconstruction and Multi-Channel Conformer","authors":"Bing Yang;Xiaofei Li","doi":"10.1109/TASLP.2024.3458811","DOIUrl":"10.1109/TASLP.2024.3458811","url":null,"abstract":"Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4211-4225"},"PeriodicalIF":4.1,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations","authors":"Cheng Gong;Xin Wang;Erica Cooper;Dan Wells;Longbiao Wang;Jianwu Dang;Korin Richmond;Junichi Yamagishi","doi":"10.1109/TASLP.2024.3451951","DOIUrl":"10.1109/TASLP.2024.3451951","url":null,"abstract":"Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voice, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4036-4051"},"PeriodicalIF":4.1,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10669054","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Li;Zhichao Wang;Xinfa Zhu;Jian Cong;Qiao Tian;Yuping Wang;Lei Xie
{"title":"U-Style: Cascading U-Nets With Multi-Level Speaker and Style Modeling for Zero-Shot Voice Cloning","authors":"Tao Li;Zhichao Wang;Xinfa Zhu;Jian Cong;Qiao Tian;Yuping Wang;Lei Xie","doi":"10.1109/TASLP.2024.3453606","DOIUrl":"10.1109/TASLP.2024.3453606","url":null,"abstract":"Zero-shot speaker cloning aims to synthesize speech for any target speaker unseen during TTS system building, given only a single speech reference of the speaker at hand. Although more practical in real applications, the current zero-shot methods still produce speech with undesirable naturalness and speaker similarity. Moreover, endowing the target speaker with arbitrary speaking styles in the zero-shot setup has not been considered. This is because the unique challenge of \u0000<italic>zero-shot speaker and style cloning</i>\u0000 is to learn the disentangled speaker and style representations from only short references representing an arbitrary speaker and an arbitrary style. To address this challenge, we propose \u0000<italic>U-Style</i>\u0000, which employs Grad-TTS as the backbone, particularly cascading a \u0000<italic>speaker-specific encoder</i>\u0000 and a \u0000<italic>style-specific encoder</i>\u0000 between the text encoder and the diffusion decoder. Thus, leveraging signal perturbation, U-Style is explicitly decomposed into speaker- and style-specific modeling parts, achieving better speaker and style disentanglement. To improve unseen speaker and style modeling ability, these two encoders conduct multi-level speaker and style modeling by skip-connected U-nets, incorporating the representation extraction and information reconstruction process. Besides, to improve the naturalness of synthetic speech, we adopt mean-based instance normalization and style adaptive layer normalization in these encoders to perform representation extraction and condition adaptation, respectively. Experiments show that U-Style significantly surpasses the state-of-the-art methods in unseen speaker cloning regarding naturalness and speaker similarity. Notably, U-Style can transfer the style from an unseen source speaker to another unseen target speaker, achieving flexible combinations of desired speaker timbre and style in zero-shot voice cloning.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4026-4035"},"PeriodicalIF":4.1,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Deppisch;Nils Meyer-Kahlen;Sebastià V. Amengual Garí
{"title":"Blind Identification of Binaural Room Impulse Responses From Smart Glasses","authors":"Thomas Deppisch;Nils Meyer-Kahlen;Sebastià V. Amengual Garí","doi":"10.1109/TASLP.2024.3454964","DOIUrl":"10.1109/TASLP.2024.3454964","url":null,"abstract":"Smart glasses are increasingly recognized as a key medium for augmented reality, offering a hands-free platform with integrated microphones and non-ear-occluding loudspeakers to seamlessly mix virtual sound sources into the real-world acoustic scene. To convincingly integrate virtual sound sources, the room acoustic rendering of the virtual sources must match the real-world acoustics. Information about a user's acoustic environment however is typically not available. This work uses a microphone array in a pair of smart glasses to blindly identify binaural room impulse responses (BRIRs) from a few seconds of speech in the real-world environment. The proposed method uses dereverberation and beamforming to generate a pseudo reference signal that is used by a multichannel Wiener filter to estimate room impulse responses which are then converted to BRIRs. The multichannel room impulse responses can be used to estimate room acoustic parameters which is shown to outperform baseline algorithms in the estimation of reverberation time and direct-to-reverberant energy ratio. Results from a listening experiment further indicate that the estimated BRIRs often reproduce the real-world room acoustics perceptually more convincingly than measured BRIRs from other rooms of similar size.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4052-4065"},"PeriodicalIF":4.1,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition","authors":"Sei Ueno;Akinobu Lee;Tatsuya Kawahara","doi":"10.1109/TASLP.2024.3451982","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3451982","url":null,"abstract":"While end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched text to speech (TTS) has been investigated as one approach to data augmentation. The quality and diversity of the synthesized speech are critical in this approach. To ensure quality, a neural vocoder is widely used to generate speech waveforms in conventional studies, but it requires a huge amount of computation and another conversion to spectral-domain features such as the log-Mel filterbank (lmfb) output typically used for ASR. In this study, we explore the direct refinement of these features. Unlike conventional speech enhancement, we can use information on the ground-truth phone sequences of the speech and designated speaker to improve the quality and diversity. This process is realized as a Mel-to-Mel network, which can be placed after a text-to-Mel synthesis system such as FastSpeech 2. These two networks can be trained jointly. Moreover, semantic masking is applied to the lmfb features for robust training. Experimental evaluations demonstrate the effect of phone information, speaker information, and semantic masking. For speaker information, x-vector performs better than the simple speaker embedding. The proposed method achieves even better ASR performance with a much shorter computation time than the conventional method using a vocoder.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3924-3933"},"PeriodicalIF":4.1,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yadong Guan;Jiqing Han;Hongwei Song;Shiwen Deng;Guibin Zheng;Tieran Zheng;Yongjun He
{"title":"Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event Detection","authors":"Yadong Guan;Jiqing Han;Hongwei Song;Shiwen Deng;Guibin Zheng;Tieran Zheng;Yongjun He","doi":"10.1109/TASLP.2024.3451983","DOIUrl":"10.1109/TASLP.2024.3451983","url":null,"abstract":"The training of sound event detection (SED) models remains a challenge of insufficient supervision due to limited frame-wise labeled data. Mainstream research on this problem has adopted semi-supervised training strategies that generate pseudo-labels for unlabeled data and use these data for the training of a model. Recent works further introduce multi-task training strategies to impose additional supervision. However, the auxiliary tasks employed in these methods either lack frame-wise guidance or exhibit unsuitable task designs. Furthermore, they fail to exploit inter-task relationships effectively, which can serve as valuable supervision. In this paper, we introduce a novel task, sound occurrence and overlap detection (SOD), which detects predefined sound activity patterns, including non-overlapping and overlapping cases. On the basis of SOD, we propose a cross-task collaborative training framework that leverages the relationship between SED and SOD to improve the SED model. Firstly, by jointly optimizing the two tasks in a multi-task manner, the SED model is encouraged to learn features sensitive to sound activity. Subsequently, the cross-task consistency regularization is proposed to promote consistent predictions between SED and SOD. Finally, we propose a pseudo-label selection method that uses inconsistent predictions between the two tasks to identify potential wrong pseudo-labels and mitigate their confirmation bias. In the inference phase, only the trained SED model is used, thus no additional computation and storage costs are incurred. Extensive experiments on the DESED dataset demonstrate the effectiveness of our method.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3947-3959"},"PeriodicalIF":4.1,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinbo Hu;Yin Cao;Ming Wu;Qiuqiang Kong;Feiran Yang;Mark D. Plumbley;Jun Yang
{"title":"Selective-Memory Meta-Learning With Environment Representations for Sound Event Localization and Detection","authors":"Jinbo Hu;Yin Cao;Ming Wu;Qiuqiang Kong;Feiran Yang;Mark D. Plumbley;Jun Yang","doi":"10.1109/TASLP.2024.3451974","DOIUrl":"10.1109/TASLP.2024.3451974","url":null,"abstract":"Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, obtaining annotated samples for spatial sound events is notably costly. Deploying a SELD system in a new environment requires extensive time for re-training and fine-tuning. To overcome these challenges, we propose environment-adaptive Meta-SELD, designed for efficient adaptation to new environments using minimal data. Our method specifically utilizes computationally synthesized spatial data and employs Model-Agnostic Meta-Learning (MAML) on a pre-trained, environment-independent model. The method then utilizes fast adaptation to unseen real-world environments using limited samples from the respective environments. Inspired by the Learning-to-Forget approach, we introduce the concept of selective memory as a strategy for resolving conflicts across environments. This approach involves selectively memorizing target-environment-relevant information and adapting to the new environments through the selective attenuation of model parameters. In addition, we introduce environment representations to characterize different acoustic settings, enhancing the adaptability of our attenuation approach to various environments. We evaluate our proposed method on the development set of the Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset and computationally synthesized scenes. Experimental results demonstrate the superior performance of the proposed method compared to conventional supervised learning methods, particularly in localization.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4313-4327"},"PeriodicalIF":4.1,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Johannes W. de Vries;Steven van de Par;Geert Leus;Richard Heusdens;Richard C. Hendriks
{"title":"Binaural Beamforming Taking Into Account Spatial Release From Masking","authors":"Johannes W. de Vries;Steven van de Par;Geert Leus;Richard Heusdens;Richard C. Hendriks","doi":"10.1109/TASLP.2024.3451988","DOIUrl":"10.1109/TASLP.2024.3451988","url":null,"abstract":"Hearing impairment is a prevalent problem with daily challenges like impaired speech intelligibility and sound localisation. One of the shortcomings of spatial filtering in hearing aids is that speech intelligibility is often not optimised directly, meaning that different auditory processes contributing to intelligibility are often not considered. One example is the perceptual phenomenon known as spatial release from masking (SRM). This paper develops a signal model that explicitly considers SRM in the beamforming design, achieved by transforming the binaural intelligibility prediction model (BSIM) into a signal processing framework. The resulting extended signal model is used to analyse the performance of reference beamformers and design a novel beamformer that more closely considers how the auditory system perceives binaural sound. It can be shown that the binaural minimum variance distortionless response (BMVDR) beamformer is also an optimal solution for the extended, perceived model, suggesting that SRM does not play a significant role in intelligibility enhancement after optimal beamforming. However, the optimal beamformer is no longer unique in the extended signal model. The additional secondary degrees of freedom can be used to preserve binaural cues of interfering sources while still achieving the same perceived performance of the BMVDR beamformer, though with a possible high sensitivity to intelligibility model mismatch errors.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4002-4012"},"PeriodicalIF":4.1,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingyang Zhang;Yi Zhou;Yi Ren;Chen Zhang;Xiang Yin;Haizhou Li
{"title":"RefXVC: Cross-Lingual Voice Conversion With Enhanced Reference Leveraging","authors":"Mingyang Zhang;Yi Zhou;Yi Ren;Chen Zhang;Xiang Yin;Haizhou Li","doi":"10.1109/TASLP.2024.3439996","DOIUrl":"10.1109/TASLP.2024.3439996","url":null,"abstract":"This paper proposes RefXVC, a method for cross-lingual voice conversion (XVC) that leverages reference information to improve conversion performance. Previous XVC works generally take an average speaker embedding to condition the speaker identity, which does not account for the changing timbre of speech that occurs with different pronunciations. To address this, our method uses both global and local speaker embeddings to capture the timbre changes during speech conversion. Additionally, we observed a connection between timbre and pronunciation in different languages and utilized this by incorporating a timbre encoder and a pronunciation matching network into our model. Furthermore, we found that the variation in tones is not adequately reflected in a sentence, and therefore, we used multiple references to better capture the range of a speaker's voice. The proposed method outperformed existing systems in terms of both speech quality and speaker similarity, highlighting the effectiveness of leveraging reference information in cross-lingual voice conversion.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4146-4156"},"PeriodicalIF":4.1,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}