arXiv - EE - Audio and Speech Processing最新文献_第5页

Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data 利用建构的代码切换数据增强 LLM 中的多语言语音生成和识别能力

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-17 DOI: arxiv-2409.10969

Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen

{"title":"Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data","authors":"Jing Xu, Daxin Tan, Jiaqi Wang, Xiao Chen","doi":"arxiv-2409.10969","DOIUrl":"https://doi.org/arxiv-2409.10969","url":null,"abstract":"While large language models (LLMs) have been explored in the speech domain\u0000for both generation and recognition tasks, their applications are predominantly\u0000confined to the monolingual scenario, with limited exploration in multilingual\u0000and code-switched (CS) contexts. Additionally, speech generation and\u0000recognition tasks are often handled separately, such as VALL-E and Qwen-Audio.\u0000In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating\u0000multilingual speech generation and recognition tasks within the single LLM.\u0000Furthermore, we develop an effective data construction approach that splits and\u0000concatenates words from different languages to equip LLMs with CS synthesis\u0000ability without relying on CS data. The experimental results demonstrate that\u0000our model outperforms other baselines with a comparable data scale.\u0000Furthermore, our data construction approach not only equips LLMs with CS speech\u0000synthesis capability with comparable speaker consistency and similarity to any\u0000given speaker, but also improves the performance of LLMs in multilingual speech\u0000generation and recognition tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy 3DFacePolicy：采用扩散策略的语音驱动三维面部动画

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-17 DOI: arxiv-2409.10848

Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Yuki Uranishi

引用次数: 0

Spontaneous Informal Speech Dataset for Punctuation Restoration 用于标点符号修复的自发非正式语音数据集

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-17 DOI: arxiv-2409.11241

Xing Yi Liu, Homayoon Beigi

引用次数: 0

Room impulse response prototyping using receiver distance estimations for high quality room equalisation algorithms 利用接收器距离估算室内脉冲响应原型，实现高质量室内均衡算法

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-16 DOI: arxiv-2409.10131

James Brooks-Park, Martin Bo Møller, Jan Østergaard, Søren Bech, Steven van de Par

引用次数: 0

Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement 利用 MAMBA 联合频谱和空间学习进行多通道语音增强

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-16 DOI: arxiv-2409.10376

Wenze Ren, Haibin Wu, Yi-Cheng Lin, Xuanjun Chen, Rong Chao, Kuo-Hsuan Hung, You-Jin Li, Wen-Yuan Ting, Hsin-Min Wang, Yu Tsao

{"title":"Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement","authors":"Wenze Ren, Haibin Wu, Yi-Cheng Lin, Xuanjun Chen, Rong Chao, Kuo-Hsuan Hung, You-Jin Li, Wen-Yuan Ting, Hsin-Min Wang, Yu Tsao","doi":"arxiv-2409.10376","DOIUrl":"https://doi.org/arxiv-2409.10376","url":null,"abstract":"In multichannel speech enhancement, effectively capturing spatial and\u0000spectral information across different microphones is crucial for noise\u0000reduction. Traditional methods, such as CNN or LSTM, attempt to model the\u0000temporal dynamics of full-band and sub-band spectral and spatial features.\u0000However, these approaches face limitations in fully modeling complex temporal\u0000dependencies, especially in dynamic acoustic environments. To overcome these\u0000challenges, we modify the current advanced model McNet by introducing an\u0000improved version of Mamba, a state-space model, and further propose MCMamba.\u0000MCMamba has been completely reengineered to integrate full-band and narrow-band\u0000spatial information with sub-band and full-band spectral features, providing a\u0000more comprehensive approach to modeling spatial and spectral information. Our\u0000experimental results demonstrate that MCMamba significantly improves the\u0000modeling of spatial and spectral features in multichannel speech enhancement,\u0000outperforming McNet and achieving state-of-the-art performance on the CHiME-3\u0000dataset. Additionally, we find that Mamba performs exceptionally well in\u0000modeling spectral information.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating Training Objectives for Generative Speech Enhancement 研究生成式语音增强的训练目标

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-16 DOI: arxiv-2409.10753

Julius Richter, Danilo de Oliveira, Timo Gerkmann

引用次数: 0

oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models oboVox 远场扬声器识别：使用预训练模型的新型数据增强方法

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-16 DOI: arxiv-2409.10240

Muhammad Sudipto Siam Dip, Md Anik Hasan, Sapnil Sarker Bipro, Md Abdur Raiyan, Mohammod Abdul Motin

引用次数: 0

RF-GML: Reference-Free Generative Machine Listener RF-GML：无参考生成机器监听器

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-16 DOI: arxiv-2409.10210

Arijit Biswas, Guanxin Jiang

引用次数: 0

Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages Meta-Whisper：基于语音的元智能语言（Meta-ICL），用于低资源语言的 ASR

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-16 DOI: arxiv-2409.10429

Ming-Hao Hsu, Kuan Po Huang, Hung-yi Lee

引用次数: 0

An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems 交互式口语对话系统的高效自学习框架

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-16 DOI: arxiv-2409.10515

Hitesh Tulsiani, David M. Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister

{"title":"An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems","authors":"Hitesh Tulsiani, David M. Chan, Shalini Ghosh, Garima Lalwani, Prabhat Pandey, Ankish Bansal, Sri Garimella, Ariya Rastrow, Björn Hoffmeister","doi":"arxiv-2409.10515","DOIUrl":"https://doi.org/arxiv-2409.10515","url":null,"abstract":"Dialog systems, such as voice assistants, are expected to engage with users\u0000in complex, evolving conversations. Unfortunately, traditional automatic speech\u0000recognition (ASR) systems deployed in such applications are usually trained to\u0000recognize each turn independently and lack the ability to adapt to the\u0000conversational context or incorporate user feedback. In this work, we introduce\u0000a general framework for ASR in dialog systems that can go beyond learning from\u0000single-turn utterances and learn over time how to adapt to both explicit\u0000supervision and implicit user feedback present in multi-turn conversations. We\u0000accomplish that by leveraging advances in student-teacher learning and\u0000context-aware dialog processing, and designing contrastive self-supervision\u0000approaches with Ohm, a new online hard-negative mining approach. We show that\u0000leveraging our new framework compared to traditional training leads to relative\u0000WER reductions of close to 10% in real-world dialog systems, and up to 26% on\u0000public synthetic data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0