arXiv - EE - Audio and Speech Processing最新文献_第10页

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT 结合多方言音素级 BERT 的跨方言音高附着语言文本到语音技术

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07265

Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari

引用次数: 0

FlowSep: Language-Queried Sound Separation with Rectified Flow Matching FlowSep：通过整流匹配进行语言查询声音分离

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07614

Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang

{"title":"FlowSep: Language-Queried Sound Separation with Rectified Flow Matching","authors":"Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang","doi":"arxiv-2409.07614","DOIUrl":"https://doi.org/arxiv-2409.07614","url":null,"abstract":"Language-queried audio source separation (LASS) focuses on separating sounds\u0000using textual descriptions of the desired sources. Current methods mainly use\u0000discriminative approaches, such as time-frequency masking, to separate target\u0000sounds and minimize interference from other sources. However, these models face\u0000challenges when separating overlapping soundtracks, which may lead to artifacts\u0000such as spectral holes or incomplete separation. Rectified flow matching (RFM),\u0000a generative model that establishes linear relations between the distribution\u0000of data and noise, offers superior theoretical properties and simplicity, but\u0000has not yet been explored in sound separation. In this work, we introduce\u0000FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns\u0000linear flow trajectories from noise to target source features within the\u0000variational autoencoder (VAE) latent space. During inference, the RFM-generated\u0000latent features are reconstructed into a mel-spectrogram via the pre-trained\u0000VAE decoder, followed by a pre-trained vocoder to synthesize the waveform.\u0000Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art\u0000models across multiple benchmarks, as evaluated with subjective and objective\u0000metrics. Additionally, our results show that FlowSep surpasses a\u0000diffusion-based LASS model in both separation quality and inference efficiency,\u0000highlighting its strong potential for audio source separation tasks. Code,\u0000pre-trained models and demos can be found at:\u0000https://audio-agi.github.io/FlowSep_demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"460 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection 声源定位的分析类增量学习与隐私保护

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07224

Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li

{"title":"Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection","authors":"Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li","doi":"arxiv-2409.07224","DOIUrl":"https://doi.org/arxiv-2409.07224","url":null,"abstract":"Sound Source Localization (SSL) enabling technology for applications such as\u0000surveillance and robotics. While traditional Signal Processing (SP)-based SSL\u0000methods provide analytic solutions under specific signal and noise assumptions,\u0000recent Deep Learning (DL)-based methods have significantly outperformed them.\u0000However, their success depends on extensive training data and substantial\u0000computational resources. Moreover, they often rely on large-scale annotated\u0000spatial data and may struggle when adapting to evolving sound classes. To\u0000mitigate these challenges, we propose a novel Class Incremental Learning (CIL)\u0000approach, termed SSL-CIL, which avoids serious accuracy degradation due to\u0000catastrophic forgetting by incrementally updating the DL-based SSL model\u0000through a closed-form analytic solution. In particular, data privacy is ensured\u0000since the learning process does not revisit any historical data\u0000(exemplar-free), which is more suitable for smart home scenarios. Empirical\u0000results in the public SSLR dataset demonstrate the superior performance of our\u0000proposal, achieving a localization accuracy of 90.9%, surpassing other\u0000competitive methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array 使用圆形麦克风阵列为多扬声器场景进行神经 Ambisonic 编码

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.06954

Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu

{"title":"Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array","authors":"Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu","doi":"arxiv-2409.06954","DOIUrl":"https://doi.org/arxiv-2409.06954","url":null,"abstract":"Spatial audio formats like Ambisonics are playback device layout-agnostic and\u0000well-suited for applications such as teleconferencing and virtual reality.\u0000Conventional Ambisonic encoding methods often rely on spherical microphone\u0000arrays for efficient sound field capture, which limits their flexibility in\u0000practical scenarios. We propose a deep learning (DL)-based approach, leveraging\u0000a two-stage network architecture for encoding circular microphone array signals\u0000into second-order Ambisonics (SOA) in multi-speaker environments. In addition,\u0000we introduce: (i) a novel loss function based on spatial power maps to\u0000regularize inter-channel correlations of the Ambisonic signals, and (ii) a\u0000channel permutation technique to resolve the ambiguity of encoding vertical\u0000information using a horizontal circular array. Evaluation on simulated speech\u0000and noise datasets shows that our approach consistently outperforms traditional\u0000signal processing (SP) and DL-based methods, providing significantly better\u0000timbral and spatial quality and higher source localization accuracy. Binaural\u0000audio demos with visualizations are available at\u0000https://bridgoon97.github.io/NeuralAmbisonicEncoding/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis SSR-Speech：实现基于文本的稳定、安全和鲁棒的零镜头语音编辑与合成

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-11 DOI: arxiv-2409.07556

Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu

引用次数: 0

VoiceWukong: Benchmarking Deepfake Voice Detection 悟空语音：深度伪语音检测基准测试

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06348

Ziwei Yan, Yanjie Zhao, Haoyu Wang

{"title":"VoiceWukong: Benchmarking Deepfake Voice Detection","authors":"Ziwei Yan, Yanjie Zhao, Haoyu Wang","doi":"arxiv-2409.06348","DOIUrl":"https://doi.org/arxiv-2409.06348","url":null,"abstract":"With the rapid advancement of technologies like text-to-speech (TTS) and\u0000voice conversion (VC), detecting deepfake voices has become increasingly\u0000crucial. However, both academia and industry lack a comprehensive and intuitive\u0000benchmark for evaluating detectors. Existing datasets are limited in language\u0000diversity and lack many manipulations encountered in real-world production\u0000environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate\u0000the performance of deepfake voice detectors. To build the dataset, we first\u0000collected deepfake voices generated by 19 advanced and widely recognized\u0000commercial tools and 15 open-source tools. We then created 38 data variants\u0000covering six types of manipulations, constructing the evaluation dataset for\u0000deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200\u0000Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12\u0000state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of\u000013.50%, while all others exceeded 20%. Our findings reveal that these detectors\u0000face significant challenges in real-world applications, with dramatically\u0000declining performance. In addition, we conducted a user study with more than\u0000300 participants. The results are compared with the performance of the 12\u0000detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,\u0000where different detectors and humans exhibit varying identification\u0000capabilities for deepfake voices at different deception levels, while the LALM\u0000demonstrates no detection ability at all. Furthermore, we provide a leaderboard\u0000for deepfake voice detection, publicly available at\u0000{https://voicewukong.github.io}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself InstructSing：通过自学生成高保真歌声

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06330

Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen

{"title":"InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself","authors":"Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen","doi":"arxiv-2409.06330","DOIUrl":"https://doi.org/arxiv-2409.06330","url":null,"abstract":"It is challenging to accelerate the training process while ensuring both\u0000high-quality generated voices and acceptable inference speed. In this paper, we\u0000propose a novel neural vocoder called InstructSing, which can converge much\u0000faster compared with other neural vocoders while maintaining good performance\u0000by integrating differentiable digital signal processing and adversarial\u0000training. It includes one generator and two discriminators. Specifically, the\u0000generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio\u0000as an instructive signal. Subsequently, the HN module is connected with an\u0000extended WaveNet by an UNet-based module, which transforms the output of the HN\u0000module to a latent variable sequence containing essential periodic and\u0000aperiodic information. In addition to the latent sequence, the extended WaveNet\u0000also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing\u0000voices. In terms of discriminators, we combine a multi-period discriminator, as\u0000originally proposed in HiFiGAN, with a multi-resolution multi-band STFT\u0000discriminator. Notably, InstructSing achieves comparable voice quality to other\u0000neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA\u0000V100 GPU machinefootnote{{Demo page:\u0000href{https://wavelandspeech.github.io/instructsing/}{texttt{https://wavelandspeech.github.io/instructsing/}}}}.\u0000We plan to open-source our code and pretrained model once the paper get\u0000accepted.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches 识别欺骗的扬声器验证可抵御领域和信道错配

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06327

Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi

{"title":"Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches","authors":"Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi","doi":"arxiv-2409.06327","DOIUrl":"https://doi.org/arxiv-2409.06327","url":null,"abstract":"In real-world applications, it is challenging to build a speaker verification\u0000system that is simultaneously robust against common threats, including spoofing\u0000attacks, channel mismatch, and domain mismatch. Traditional automatic speaker\u0000verification (ASV) systems often tackle these issues separately, leading to\u0000suboptimal performance when faced with simultaneous challenges. In this paper,\u0000we propose an integrated framework that incorporates pair-wise learning and\u0000spoofing attack simulation into the meta-learning paradigm to enhance\u0000robustness against these multifaceted threats. This novel approach employs an\u0000asymmetric dual-path model and a multi-task learning strategy to handle ASV,\u0000anti-spoofing, and spoofing-aware ASV tasks concurrently. A new testing\u0000dataset, CNComplex, is introduced to evaluate system performance under these\u0000combined threats. Experimental results demonstrate that our integrated model\u0000significantly improves performance over traditional ASV systems across various\u0000scenarios, showcasing its potential for real-world deployment. Additionally,\u0000the proposed framework's ability to generalize across different conditions\u0000highlights its robustness and reliability, making it a promising solution for\u0000practical ASV applications.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models 增强音频问题解答中的时态理解，建立大型音频语言模型

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06223

Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

{"title":"Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models","authors":"Arvind Krishna Sridhar, Yinyi Guo, Erik Visser","doi":"arxiv-2409.06223","DOIUrl":"https://doi.org/arxiv-2409.06223","url":null,"abstract":"The Audio Question Answering task includes audio event classification, audio\u0000captioning, and open ended reasoning. Recently, Audio Question Answering has\u0000garnered attention due to the advent of Large Audio Language Models. Current\u0000literature focuses on constructing LALMs by integrating audio encoders with\u0000text only Large Language Models through a projection module. While Large Audio\u0000Language Models excel in general audio understanding, they are limited in\u0000temporal reasoning which may hinder their commercial applications and on device\u0000deployment. This paper addresses these challenges and limitations in audio\u0000temporal reasoning. First, we introduce a data augmentation technique for\u0000generating reliable audio temporal questions and answers using an LLM. Second,\u0000we propose a continued finetuning curriculum learning strategy to specialize in\u0000temporal reasoning without compromising performance on finetuned tasks.\u0000Finally, we develop a reliable and transparent automated metric, assisted by an\u0000LLM, to measure the correlation between Large Audio Language Model responses\u0000and ground truth data intelligently. We demonstrate the effectiveness of our\u0000proposed techniques using SOTA LALMs on public audio benchmark datasets.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders MoWE-Audio：使用弱编码器混合的多任务音频LLMs

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI: arxiv-2409.06635

Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw

引用次数: 0