{"title":"Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT","authors":"Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari","doi":"arxiv-2409.07265","DOIUrl":"https://doi.org/arxiv-2409.07265","url":null,"abstract":"We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize\u0000learned speakers' voices in non-native dialects, especially in pitch-accent\u0000languages. CD-TTS is important for developing voice agents that naturally\u0000communicate with people across regions. We present a novel TTS model comprising\u0000three sub-modules to perform competitively at this task. We first train a\u0000backbone TTS model to synthesize dialect speech from a text conditioned on\u0000phoneme-level accent latent variables (ALVs) extracted from speech by a\u0000reference encoder. Then, we train an ALV predictor to predict ALVs tailored to\u0000a target dialect from input text leveraging our novel multi-dialect\u0000phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the\u0000effectiveness of our model by comparing it with a baseline derived from\u0000conventional dialect TTS methods. The results show that our model improves the\u0000dialectal naturalness of synthetic speech in CD-TTS.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang
{"title":"FlowSep: Language-Queried Sound Separation with Rectified Flow Matching","authors":"Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang","doi":"arxiv-2409.07614","DOIUrl":"https://doi.org/arxiv-2409.07614","url":null,"abstract":"Language-queried audio source separation (LASS) focuses on separating sounds\u0000using textual descriptions of the desired sources. Current methods mainly use\u0000discriminative approaches, such as time-frequency masking, to separate target\u0000sounds and minimize interference from other sources. However, these models face\u0000challenges when separating overlapping soundtracks, which may lead to artifacts\u0000such as spectral holes or incomplete separation. Rectified flow matching (RFM),\u0000a generative model that establishes linear relations between the distribution\u0000of data and noise, offers superior theoretical properties and simplicity, but\u0000has not yet been explored in sound separation. In this work, we introduce\u0000FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns\u0000linear flow trajectories from noise to target source features within the\u0000variational autoencoder (VAE) latent space. During inference, the RFM-generated\u0000latent features are reconstructed into a mel-spectrogram via the pre-trained\u0000VAE decoder, followed by a pre-trained vocoder to synthesize the waveform.\u0000Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art\u0000models across multiple benchmarks, as evaluated with subjective and objective\u0000metrics. Additionally, our results show that FlowSep surpasses a\u0000diffusion-based LASS model in both separation quality and inference efficiency,\u0000highlighting its strong potential for audio source separation tasks. Code,\u0000pre-trained models and demos can be found at:\u0000https://audio-agi.github.io/FlowSep_demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"460 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li
{"title":"Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection","authors":"Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li","doi":"arxiv-2409.07224","DOIUrl":"https://doi.org/arxiv-2409.07224","url":null,"abstract":"Sound Source Localization (SSL) enabling technology for applications such as\u0000surveillance and robotics. While traditional Signal Processing (SP)-based SSL\u0000methods provide analytic solutions under specific signal and noise assumptions,\u0000recent Deep Learning (DL)-based methods have significantly outperformed them.\u0000However, their success depends on extensive training data and substantial\u0000computational resources. Moreover, they often rely on large-scale annotated\u0000spatial data and may struggle when adapting to evolving sound classes. To\u0000mitigate these challenges, we propose a novel Class Incremental Learning (CIL)\u0000approach, termed SSL-CIL, which avoids serious accuracy degradation due to\u0000catastrophic forgetting by incrementally updating the DL-based SSL model\u0000through a closed-form analytic solution. In particular, data privacy is ensured\u0000since the learning process does not revisit any historical data\u0000(exemplar-free), which is more suitable for smart home scenarios. Empirical\u0000results in the public SSLR dataset demonstrate the superior performance of our\u0000proposal, achieving a localization accuracy of 90.9%, surpassing other\u0000competitive methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array","authors":"Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu","doi":"arxiv-2409.06954","DOIUrl":"https://doi.org/arxiv-2409.06954","url":null,"abstract":"Spatial audio formats like Ambisonics are playback device layout-agnostic and\u0000well-suited for applications such as teleconferencing and virtual reality.\u0000Conventional Ambisonic encoding methods often rely on spherical microphone\u0000arrays for efficient sound field capture, which limits their flexibility in\u0000practical scenarios. We propose a deep learning (DL)-based approach, leveraging\u0000a two-stage network architecture for encoding circular microphone array signals\u0000into second-order Ambisonics (SOA) in multi-speaker environments. In addition,\u0000we introduce: (i) a novel loss function based on spatial power maps to\u0000regularize inter-channel correlations of the Ambisonic signals, and (ii) a\u0000channel permutation technique to resolve the ambiguity of encoding vertical\u0000information using a horizontal circular array. Evaluation on simulated speech\u0000and noise datasets shows that our approach consistently outperforms traditional\u0000signal processing (SP) and DL-based methods, providing significantly better\u0000timbral and spatial quality and higher source localization accuracy. Binaural\u0000audio demos with visualizations are available at\u0000https://bridgoon97.github.io/NeuralAmbisonicEncoding/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis","authors":"Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu","doi":"arxiv-2409.07556","DOIUrl":"https://doi.org/arxiv-2409.07556","url":null,"abstract":"In this paper, we introduce SSR-Speech, a neural codec autoregressive model\u0000designed for stable, safe, and robust zero-shot text-based speech editing and\u0000text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and\u0000incorporates classifier-free guidance to enhance the stability of the\u0000generation process. A watermark Encodec is proposed to embed frame-level\u0000watermarks into the edited regions of the speech so that which parts were\u0000edited can be detected. In addition, the waveform reconstruction leverages the\u0000original unedited speech segments, providing superior recovery compared to the\u0000Encodec model. Our approach achieves the state-of-the-art performance in the\u0000RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing\u0000previous methods. Furthermore, SSR-Speech excels in multi-span speech editing\u0000and also demonstrates remarkable robustness to background sounds. Source code\u0000and demos are released.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VoiceWukong: Benchmarking Deepfake Voice Detection","authors":"Ziwei Yan, Yanjie Zhao, Haoyu Wang","doi":"arxiv-2409.06348","DOIUrl":"https://doi.org/arxiv-2409.06348","url":null,"abstract":"With the rapid advancement of technologies like text-to-speech (TTS) and\u0000voice conversion (VC), detecting deepfake voices has become increasingly\u0000crucial. However, both academia and industry lack a comprehensive and intuitive\u0000benchmark for evaluating detectors. Existing datasets are limited in language\u0000diversity and lack many manipulations encountered in real-world production\u0000environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate\u0000the performance of deepfake voice detectors. To build the dataset, we first\u0000collected deepfake voices generated by 19 advanced and widely recognized\u0000commercial tools and 15 open-source tools. We then created 38 data variants\u0000covering six types of manipulations, constructing the evaluation dataset for\u0000deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200\u0000Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12\u0000state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of\u000013.50%, while all others exceeded 20%. Our findings reveal that these detectors\u0000face significant challenges in real-world applications, with dramatically\u0000declining performance. In addition, we conducted a user study with more than\u0000300 participants. The results are compared with the performance of the 12\u0000detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,\u0000where different detectors and humans exhibit varying identification\u0000capabilities for deepfake voices at different deception levels, while the LALM\u0000demonstrates no detection ability at all. Furthermore, we provide a leaderboard\u0000for deepfake voice detection, publicly available at\u0000{https://voicewukong.github.io}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself","authors":"Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen","doi":"arxiv-2409.06330","DOIUrl":"https://doi.org/arxiv-2409.06330","url":null,"abstract":"It is challenging to accelerate the training process while ensuring both\u0000high-quality generated voices and acceptable inference speed. In this paper, we\u0000propose a novel neural vocoder called InstructSing, which can converge much\u0000faster compared with other neural vocoders while maintaining good performance\u0000by integrating differentiable digital signal processing and adversarial\u0000training. It includes one generator and two discriminators. Specifically, the\u0000generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio\u0000as an instructive signal. Subsequently, the HN module is connected with an\u0000extended WaveNet by an UNet-based module, which transforms the output of the HN\u0000module to a latent variable sequence containing essential periodic and\u0000aperiodic information. In addition to the latent sequence, the extended WaveNet\u0000also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing\u0000voices. In terms of discriminators, we combine a multi-period discriminator, as\u0000originally proposed in HiFiGAN, with a multi-resolution multi-band STFT\u0000discriminator. Notably, InstructSing achieves comparable voice quality to other\u0000neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA\u0000V100 GPU machinefootnote{{Demo page:\u0000href{https://wavelandspeech.github.io/instructsing/}{texttt{https://wavelandspeech.github.io/instructsing/}}}}.\u0000We plan to open-source our code and pretrained model once the paper get\u0000accepted.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches","authors":"Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi","doi":"arxiv-2409.06327","DOIUrl":"https://doi.org/arxiv-2409.06327","url":null,"abstract":"In real-world applications, it is challenging to build a speaker verification\u0000system that is simultaneously robust against common threats, including spoofing\u0000attacks, channel mismatch, and domain mismatch. Traditional automatic speaker\u0000verification (ASV) systems often tackle these issues separately, leading to\u0000suboptimal performance when faced with simultaneous challenges. In this paper,\u0000we propose an integrated framework that incorporates pair-wise learning and\u0000spoofing attack simulation into the meta-learning paradigm to enhance\u0000robustness against these multifaceted threats. This novel approach employs an\u0000asymmetric dual-path model and a multi-task learning strategy to handle ASV,\u0000anti-spoofing, and spoofing-aware ASV tasks concurrently. A new testing\u0000dataset, CNComplex, is introduced to evaluate system performance under these\u0000combined threats. Experimental results demonstrate that our integrated model\u0000significantly improves performance over traditional ASV systems across various\u0000scenarios, showcasing its potential for real-world deployment. Additionally,\u0000the proposed framework's ability to generalize across different conditions\u0000highlights its robustness and reliability, making it a promising solution for\u0000practical ASV applications.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models","authors":"Arvind Krishna Sridhar, Yinyi Guo, Erik Visser","doi":"arxiv-2409.06223","DOIUrl":"https://doi.org/arxiv-2409.06223","url":null,"abstract":"The Audio Question Answering task includes audio event classification, audio\u0000captioning, and open ended reasoning. Recently, Audio Question Answering has\u0000garnered attention due to the advent of Large Audio Language Models. Current\u0000literature focuses on constructing LALMs by integrating audio encoders with\u0000text only Large Language Models through a projection module. While Large Audio\u0000Language Models excel in general audio understanding, they are limited in\u0000temporal reasoning which may hinder their commercial applications and on device\u0000deployment. This paper addresses these challenges and limitations in audio\u0000temporal reasoning. First, we introduce a data augmentation technique for\u0000generating reliable audio temporal questions and answers using an LLM. Second,\u0000we propose a continued finetuning curriculum learning strategy to specialize in\u0000temporal reasoning without compromising performance on finetuned tasks.\u0000Finally, we develop a reliable and transparent automated metric, assisted by an\u0000LLM, to measure the correlation between Large Audio Language Model responses\u0000and ground truth data intelligently. We demonstrate the effectiveness of our\u0000proposed techniques using SOTA LALMs on public audio benchmark datasets.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw
{"title":"MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders","authors":"Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw","doi":"arxiv-2409.06635","DOIUrl":"https://doi.org/arxiv-2409.06635","url":null,"abstract":"The rapid advancements in large language models (LLMs) have significantly\u0000enhanced natural language processing capabilities, facilitating the development\u0000of AudioLLMs that process and understand speech and audio inputs alongside\u0000text. Existing AudioLLMs typically combine a pre-trained audio encoder with a\u0000pre-trained LLM, which are subsequently finetuned on specific audio tasks.\u0000However, the pre-trained audio encoder has constrained capacity to capture\u0000features for new tasks and datasets. To address this, we propose to incorporate\u0000mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE\u0000supplements a base encoder with a pool of relatively light weight encoders,\u0000selectively activated based on the audio input to enhance feature extraction\u0000without significantly increasing model size. Our empirical results demonstrate\u0000that MoWE effectively improves multi-task performance, broadening the\u0000applicability of AudioLLMs to more diverse audio tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}